To lookup or not to lookup?

by Zepper on 2008-03-16 (#31718)

In response to this post by Fx3:

- I didn't know that tables[] would imply a speed down... Heh, living and learning.

by mozz on 2008-03-17 (#31735)

Fx3 wrote:
- I didn't know that tables[] would imply a speed down... Heh, living and learning.

Accessing memory is the slowest thing in modern computers (except for doing disk I/O, which is a thousand times slower than everything else).

You should assume that accessing memory is really slow, unless you *just* accessed the same memory in the code above. If you're accessing a large lookup table (i.e. hundreds of KB large) then expect it to take dozens or hundreds of cycles. Out-of-order processors (including all x86 from Intel and AMD) can hide some of that latency from you, but often not all of it.

Modern computers have 2 or 3 levels of cache in the heirarchy between the CPU and main memory, usually an L1 cache which is 32KB or 64KB, and an L2 cache which is 512KB or 1MB. Cache lines are usually 32 or 64 bytes. When you access memory, it brings it into L2 and L1 caches as necessary (kicking out some other piece of data that hasn't been used in a while). What's critically important is that L1 is extremely fast (not as fast as CPU registers but close), while L2 is much slower, and main memory is a *lot* slower than L2.

I don't know real numbers for modern PCs, but I can tell you that on an Xbox 360 each cache line is 128 bytes, an L1 cache hit takes around 3 to 5 cycles, an L1 cache miss (i.e. L2 cache hit) is a pipeline flush which costs around 40 cycles, and an L2 cache miss (i.e. forced to read it from main memory) takes over 600 cycles. The worst part is the Xbox360 is not an out-of-order processor, so its CPU can not hide some of the cost automatically the way desktop PCs can... but I digress.

For comparison, most arithmetic instructions cost 1-3 cycles (maybe multiplies cost like 10 cycles, and divisions cost more than that). So large memory tables are bad, unless you can tell the CPU where to prefetch them from hundreds of cycles in advance. Small, frequently-accessed tables are OK--they will usually be in the L1 cache. But that cache can only hold 32KB or maybe 64KB of data, so its really important to keep your tables as small as you reasonably can.

by tepples on 2008-03-17 (#31737)

mozz wrote:
Fx3 wrote:
- I didn't know that tables[] would imply a speed down... Heh, living and learning.

Accessing memory is the slowest thing in modern computers (except for doing disk I/O, which is a thousand times slower than everything else).

Perhaps the mindset of pervasive lookup tables comes from the NES and other platforms using a 6502 CPU, where all addressable memory is as fast as cache.

Quote:
Out-of-order processors (including all x86 from Intel and AMD) can hide some of that latency from you, but often not all of it.
[...]
The worst part is the Xbox360 is not an out-of-order processor, so its CPU can not hide some of the cost automatically the way desktop PCs can... but I digress.

Out-of-order execution is designed to reschedule code even when the underlying microarchitecture changes, as it did from the i486 (single issue) to the Pentium (U-V pipes) and from the Pentium to the Pentium II (4-1-1 pipes). But every Xbox 360 CPU has exactly the same cycle-for-cycle operation, and it's the compiler's job to hide latency.

by mozz on 2008-03-17 (#31744)

tepples wrote:
Perhaps the mindset of pervasive lookup tables comes from the NES and other platforms using a 6502 CPU, where all addressable memory is as fast as cache.

Oh, it used to make sense on lots of old platforms (including early PCs, all the way through the 486 or Pentium timeframe at least). The problem is that CPUs have been getting faster and faster every year, and RAM has not been keeping up. So we're now at the point where accessing main memory requires many hundreds of cycles. Additionally, in order to make the clock speeds for CPUs higher and higher, they had to make the pipelines deeper and deeper! So anything which causes part or all of the pipeline to be flushed (such as mispredicted branches, or the cache misses on the Xbox360 CPU) gets more and more expensive. We're at the point where large memory tables can easily be a bottleneck, and using a bunch of simple math operations is often worth it if you can then use a way smaller table, or eliminate it entirely. A classic example is tables for "sin" and "cos" functions. In the era of DOOM (486) and Quake (Pentium) these were lookup tables, but no modern game uses a lookup table for something like that, it would usually miss the L1 cache and it would also pollute the caches causing more misses in other code.

tepples wrote:
Out-of-order execution is designed to reschedule code even when the underlying microarchitecture changes, as it did from the i486 (single issue) to the Pentium (U-V pipes) and from the Pentium to the Pentium II (4-1-1 pipes). But every Xbox 360 CPU has exactly the same cycle-for-cycle operation, and it's the compiler's job to hide latency.

I think out-of-order execution was mostly added to modern CPU designs because it makes it easier to get high performance. The compiler can't always schedule code well, but with OOE whatever instruction-level parallelism is there can be squeezed out by the hardware.

But the reason the Xbox 360 CPU doesn't have out-of-order execution actually has to do with the number of transistors needed to implement it. With the same die size, they could either have one dual-thread, out-of-order PPC core or they could have *three* dual-thread, in-order PPC cores. So an Xbox360 can actually run six threads at the same time in hardware (though the two hardware threads on each core have to share a pipeline and execution units, but every hardware thread can issue up to 2 instructions per 2 cycles). It was a good tradeoff I think, even if it means the programmer has to work harder to avoid the performance penalties. The Cell processor used in the PS3 also has a dual-thread PPC core in it, which I think is very similar to one of the in-order core used in the Xbox360.

by blargg on 2008-03-17 (#31752)

Quote:
The problem is that CPUs have been getting faster and faster every year, and RAM has not been keeping up.

Also, with SIMD units like MMX, SSE, and AltiVec, you can't do parallel lookups, so you want to do everything via calculations if possible.

Quote:
I think out-of-order execution was mostly added to modern CPU designs

I think you mean a particular antique CPU design. What RISC processors have out-of-order execution, beyond trying to resolve branches as early as possible?

Quote:
But the reason the Xbox 360 CPU doesn't have out-of-order execution actually has to do with the number of transistors needed to implement it.

Bingo, and this likely applies to any modern CPU design, where the compiler reorders things rather than the processor, since it can look ahead as far as it wants.

by ReaperSMS on 2008-03-18 (#31783)

In response to this post:

The later PowerPC chips are rather aggressively out of order. I believe the 970 has a reorder window of around 200 instructions. I don't think any mips or arm chips have gone down that route, but ARM is just a bit odd, and MIPS seems close to dead.

The chip can usually do a better job of scheduling than the compiler, as it has a lot more information regarding what addresses everything is dealing with, so aliasing is not much of an issue.

Technically the compiler can look at a lot more, but language semantics usually restrict what it can do, and it can't really look outside of the current function most of the time.

by mozz on 2008-03-18 (#31784)

blargg wrote:
Quote:
I think out-of-order execution was mostly added to modern CPU designs

I think you mean a particular antique CPU design. What RISC processors have out-of-order execution, beyond trying to resolve branches as early as possible?

I think most PowerPC chips are out-of-order... isn't the CPU of the Wii an out-of-order PPC core, for example? IBM makes the CPUs for all of this generation of game consoles: an out-of-order PPC for the Wii, the in-order PPC for the Xbox360, and the Cell processor for the PS3.

But what I meant by "modern CPU designs" was pretty much all x86-based chips made since the Pentium Pro in the mid-90's. The Pentium II, III, M and Core 2 processors are all derived from that out-of-order architecture first featured in the Pentium Pro, which was basically RISC. They use register renaming to a file of 40 or more internal registers, and break all of the quirky and complex x86 instructions up into simple, RISC micro-ops before executing them. Intel and AMD, driven by the competition between them, have both been near the front of the pack in RISC chip performance for years.

As for OOE, current x86 chips can have like 96 micro-ops in flight and issue 3 or more instructions every cycle, so they are able to start lots of new instructions while waiting for already-running instructions to finish, even across things like call/return boundaries. It's hard to impossible for a compiler to match that level of performance on an in-order execution chip. But I'd say the three cores of the 360 more than makes up for the difference!

by Zepper on 2008-03-18 (#31785)

- I suggest to split up this topic, there's no relation with CHR thing. ^_^;;

by Dwedit on 2008-03-18 (#31789)

But it's awesome debate about the merits of lookup tables!

On a GBA and NDS, lookup tables are king.

by blargg on 2008-03-18 (#31790)

Quote:
But it's awesome debate about the merits of lookup tables!

I consider it mostly speculation. The only hard information is how fast a particular pair of algorithms execute on a particular processor, one using a table and the other calculating it all on-the-fly. The main bit of useful information is to try both versions in any hotspots of your program.

by Zepper on 2008-03-18 (#31791)

Dwedit wrote:
But it's awesome debate about the merits of lookup tables!

On a GBA and NDS, lookup tables are king.

<newbie> Is this...

unsigned char cache[0x4000];

...equals to this...

unsigned char *cache = malloc(0x4000);

...when accessing it like cache[0x3f00]??

</newbie>

by Nessie on 2008-03-18 (#31792)

Fx3 wrote:
Dwedit wrote:
But it's awesome debate about the merits of lookup tables!

On a GBA and NDS, lookup tables are king.

<newbie> Is this...

unsigned char cache[0x4000];

...equals to this...

unsigned char *cache = malloc(0x4000);

...when accessing it like cache[0x3f00]??

</newbie>

I'm no expert so don't take my word for it, but I believe the first option uses absolute addressing (something like 0x00010000+0x4000) whereas the second would have to use a register as a pointer (esi+0x4000). So, the first option would probably be faster since it's easier to pipeline when no pointer register is involved.

by blargg on 2008-03-18 (#31793)

Without context, it's hard to say. Here's a more interesting comparison:
Code:
struct X {
unsigned char a [0x4000];
unsigned char b [0x4000];
};

struct Y {
unsigned char* a;
unsigned char* b;
};

void func( T* t )
{
t->a [0x1234] = 0;
t->a [0x2345] = 0;
}

In this case, using X is probably more efficient since there are fewer memory accesses needed. All that is needed to access an element of a or b is a pointer to the X structure and an offset. Contrast that with Y where a pointer to the structure is needed, then the a or b pointer must be read from that, and finally the offset added. If you write to the array then access it again, some compilers won't be smart enough to determine that the a and b pointers didn't change, so it will re-read them before the next access; in some cases, the compiler can't prove that they won't change, so it must re-read them.

In some cases, you can use Y and make local copies of the pointers to help the compiler prove that they don't need to be re-read, but this will use more registers so it's still not as good.
Code:
void func( Y* y )
{
unsigned char* ya = y->a;
ya [0x1234] = 0;
ya [0x2345] = 0;
}

As far as globals, I've found that many compilers generate worse code for them. As always, the only authority on efficiency is measurement of the full program. Everything else is just a source of ideas to put to the test.

by tepples on 2008-03-19 (#31803)

ReaperSMS wrote:
The chip can usually do a better job of scheduling than the compiler, as it has a lot more information regarding what addresses everything is dealing with, so aliasing is not much of an issue.

Technically the compiler can look at a lot more, but language semantics usually restrict what it can do

C99 is supposed to help solve this by adding a new keyword restrict that specifies the semantics of a function with respect to aliasing.

ReaperSMS wrote:
and it can't really look outside of the current function most of the time.

It can if the other function is in the same source code file and static. That's at least part of how it decides to consider something for inlining.

mozz wrote:
But what I meant by "modern CPU designs" was pretty much all x86-based chips made since the Pentium Pro in the mid-90's.

You and blargg are both right. The P6 architecture as amended is comparatively modern (even if a decade old), but the underlying 8086 bytecode is ancient.

mozz wrote:
They use register renaming to a file of 40 or more internal registers, and break all of the quirky and complex x86 instructions up into simple, RISC micro-ops before executing them.

On the other hand, the MIPS, PowerPC, and ARM bytecodes are designed to be decoded to micro-ops using less circuitry. That's pretty much the big difference between CISC and RISC, but at a slight cost in code density for RISC. The RISC people have made various compromises to improve code density, such as the Thumb bytecode.

mozz wrote:
As for OOE, current x86 chips can have like 96 micro-ops in flight and issue 3 or more instructions every cycle, so they are able to start lots of new instructions while waiting for already-running instructions to finish, even across things like call/return boundaries.

I can see how handling out-of-order across basic blocks might be a win on a system with few interrupts. But is this true of typical applications?

mozz wrote:
It's hard to impossible for a compiler to match that level of performance on an in-order execution chip. But I'd say the three cores of the 360 more than makes up for the difference!

That and especially the two threads per core.

blargg wrote:
The only hard information is how fast a particular pair of algorithms execute on a particular processor

What do you want to see? A space-time-accuracy comparison of CORDIC and LUT approaches to cosine calculation on a Nintendo DS, perhaps?

Nessie wrote:
I believe [accessing a buffer allocated statically] uses absolute addressing (something like 0x00010000+0x4000) whereas [accessing a buffer created with malloc()] would have to use a register as a pointer (esi+0x4000). So, the first option would probably be faster since it's easier to pipeline when no pointer register is involved.

On ARM CPUs such as that of the DS, an immediate quantity must be an 8-bit number rotated left by 0, 2, 4, 6, ..., 28, or 30 bits. Any other constant must be stored after the return instruction and loaded relative to the program counter. RAM is at 0x02000000 through 0x023FFFFF, and the absolute address of a statically allocated variable might be 0x0201acc4, which doesn't fit the immediate rule.

blargg wrote:
If you write to the array then access it again, some compilers won't be smart enough to determine that the a and b pointers didn't change, so it will re-read them before the next access; in some cases, the compiler can't prove that they won't change, so it must re-read them.

I thought that's what the volatile keyword was for: to specify that memory won't change except across a function call, unless the keyword is present. There's also the tradeoff of rereading the address from the data section vs. rereading it from the code.

Split.

by Dwedit on 2008-03-19 (#31814)

Compilers definitely generate crap code for global variables. You can get smaller/faster code by having local variable copies of them.

by blargg on 2008-03-19 (#31827)

tepples wrote:
ReaperSMS wrote:
Technically the compiler can look at a lot more, but language semantics usually restrict what it can do

C99 is supposed to help solve this by adding a new keyword restrict that specifies the semantics of a function with respect to aliasing.

Yes, restrict helps on RISC machines a lot, where it's the compiler that must move memory loads as early as possible, due to the delay until the result is available. But on x86, restrict helps much less because the instruction set favors non-parallel code.

tepples wrote:
blargg wrote:
The only hard information is how fast a particular pair of algorithms execute on a particular processor

What do you want to see? A space-time-accuracy comparison of CORDIC and LUT approaches to cosine calculation on a Nintendo DS, perhaps?

No; my point was that one should code the algorithm both ways in the actual program and measure the difference in speed, rather than rely on general benchmarks.

tepples wrote:
blargg wrote:
If you write to the array then access it again, some compilers won't be smart enough to determine that the a and b pointers didn't change, so it will re-read them before the next access; in some cases, the compiler can't prove that they won't change, so it must re-read them.

I thought that's what the volatile keyword was for: to specify that memory won't change except across a function call, unless the keyword is present. There's also the tradeoff of rereading the address from the data section vs. rereading it from the code.

No, this is about a memory write that the compiler inserted, and it not being able to prove that it didn't modify some other value also in memory. As you said above, restrict is the answer to this aliasing issue. volatile would force this unoptimal situation by making the compiler reload the pointer every time it's used.

by ReaperSMS on 2008-03-19 (#31829)

This is getting rather off topic =P

I think we can agree that modern x86's are RISC cores with an excessively complicated, but generally compact code in front. Sort of like a Mirror Universe Thumb set.

The two threads per core on the 360 do help, but they can easily clog each other up.

It really comes down to the difference between what the compiler knows at compile time, and what the chip knows at runtime. Restrict can help, but it can't solve every aliasing uncertainty at compile time. At runtime, there are no uncertainties past speculative execution across predicted branches -- the chip knows what memory you're accessing, and knows exactly what registers you're working with.

On the high end processors (not arm) the compiler can't even addres all of the avaliable registers, due to renaming. X86's haven't been saddled with the base 8 registers since the P6.

In-Order chips tend to be tied to exactly the architecturally exposed set, and punting on working around data hazards to the compiler breaks down in that situation unless you have enuogh registers to avoid reusing one for different purposes within the pipeline depth. When you have a 50 stage pipeline, and every instruction is 3 or 4 address, that ranges between difficult and impossible.

I'm a big fan of OOE chips, as they make the compiler's job much easier. Optimal compile time scheduling is one of those nasty little problems that's been an active research topic for the last 15 years. They haven't found a really good solution yet, and the last chip that tried to rely on the compiler being smart was the Itanic. The ps3 and 360 will probably be struggling with compiler quality until they get EOL'd. The wii situation is not a whole lot better, but that's mostly due to being stuck with codewarrior, crappiest compiler on earth.

by blargg on 2008-03-19 (#31834)

So you prefer to put instruction scheduling hardware in every computer rather than have it in the relatively fewer compilers used to build software? It is true that the processor has more information about the values of variables for each execution of the code, but profile-guided compiler optimization can give the compiler the same information.

by tepples on 2008-03-19 (#31836)

ReaperSMS wrote:
The wii situation is not a whole lot better, but that's mostly due to being stuck with codewarrior, crappiest compiler on earth.

Why didn't Nintendo go to GCC for the Wii like it did for the handhelds?

blargg wrote:
So you prefer to put instruction scheduling hardware in every computer rather than have it in the relatively fewer compilers used to build software?

"Relatively fewer compilers"? Tell that to anybody who programs in a language that compiles to bytecode other than that of the hardware. There's a JIT for Java and .NET on a lot of PCs out there.

by blargg on 2008-03-19 (#31838)

tepples wrote:
blargg wrote:
So you prefer to put instruction scheduling hardware in every computer rather than have it in the relatively fewer compilers used to build software?

"Relatively fewer compilers"? Tell that to anybody who programs in a language that compiles to bytecode other than that of the hardware. There's a JIT for Java and .NET on a lot of PCs out there.

But it's not in hardware. Do you really not grasp what I was asserting? The cost of putting this software-based scheduler on every PC is virtually 0, while the cost of putting a hardware-based one is very much non-zero. It's just the cost of developing the compiler that matters, not duplicating it, unlike hardware.

by mozz on 2008-03-19 (#31846)

blargg wrote:
So you prefer to put instruction scheduling hardware in every computer rather than have it in the relatively fewer compilers used to build software? It is true that the processor has more information about the values of variables for each execution of the code, but profile-guided compiler optimization can give the compiler the same information.

(1) I agree with everything ReaperSMS said in his post above yours.

(2) Yes, I would prefer to have out-of-order execution hardware in every computer. Its more effective than any static compiler can be, and it works on any program (not just ones compiled with some compiler or some optimization settings or whatever). For general-purpose computers, it is usually worth it. Of course when transistors or power efficiency is at a premium (cellphones and Xbox360's), this hardware might be too costly.

Obviously, compiler scheduling has different goals for out-of-order chips than it has for an in-order chip. For an out-of-order chip, it mainly wants to keep the functional units fed with lots of work, but it has a lot of flexibility and can try to minimize register pressure or optimize instruction issue (like the 4-1-1 decoding template mentioned above) or things like that. For an in-order chip, it has to try and put things in a good order to try and minimize bubbles/hazards and pipeline stalls. Of course lots of stalls will be for unpredictable things like cache misses, which the compiler can not do very much about. Automatic prefetching by compilers has not been very successful in the real world, and all the compilers I know of don't do it at all (leaving it up to the programmer to do it where its needed).

(3) Profile-guided optimization is no silver bullet. We usually don't even use it for console games, because of several factors: implementations that crash or produce incorrectly-optimized code, the difficulty of feeding it a repeatable but representative workload for training, and the fact that our code is being changed (bug fixes, etc) right up until the time we ship. So there isn't really time to collect and combine a bunch of profile info to make PGO final builds with (and then we'd have to test them, and what if it exposed a PGO-only optimizer bug?) With 50 programmers working on the same codebase, there's almost no point in trying to have a PGO build during production, either, because by the time the testers have run the regular binary enough to collect the PGO data to build the PGO version with, a new binary will be released to them with dozens or hundreds of source files changed, and the collected PGO data would then be useless!

The risk of optimizer bugs is real, too. Last week we found a non-PGO-related bug in the optimizer for MSVC, and that's like the most-tested compiler backend on Earth! Without PGO, we can get repeatable builds--if the source code hasn't changed, probably the generated code won't change either. With PGO, if you change the source code enough (and we always do) then your PGO data is not valid anymore and you have to collect new data. The new data will cause different codegen decisions to be made, even when compiling the parts of the source code that *didn't* change. It's a lot less repeatable, which makes it a lot less trustworthy, because the way we get confidence in it is by lots of hours of testing of the compiled binaries.

Anyway, I don't think PGO can really improve low-level scheduling enough to make it competitive with OOE hardware. Really what PGO is good for is higher-level decision making. Things like, choosing which if/else case to put in straight-line code (the 80% one) and which one to move somewhere else (the 20% one). Or, grouping infrequently-used code in one place in the program and all the frequently-used code in another place. Or, choosing to aggressively unroll this frequently-executed loop (even inlining the somewhat large function called inside of it), versus choosing not to unroll or inline in this other less-frequently-used loop. There are lots of other things compilers can do using info collected by PGO, which are sort of complementary to the low-level instruction scheduling stuff. Ideally we'd have OOE hardware and use PGO builds on it when possible (the situation for x86 today).

by mozz on 2008-03-19 (#31848)

blargg wrote:
tepples wrote:
blargg wrote:
So you prefer to put instruction scheduling hardware in every computer rather than have it in the relatively fewer compilers used to build software?

"Relatively fewer compilers"? Tell that to anybody who programs in a language that compiles to bytecode other than that of the hardware. There's a JIT for Java and .NET on a lot of PCs out there.

But it's not in hardware. Do you really not grasp what I was asserting? The cost of putting this software-based scheduler on every PC is virtually 0, while the cost of putting a hardware-based one is very much non-zero. It's just the cost of developing the compiler that matters, not duplicating it, unlike hardware.

It's apples and oranges though, because no compiler will ever be able to
avoid certain performance penalties of an in-order chip, which the out-of-order designs can mitigate or entirely absorb for you.

And even if they could, its a bit unfair to claim that "the cost would be zero" because optimizing compilers are huge projects whose cost is measured in man-decades!

by ReaperSMS on 2008-03-19 (#31852)

Duplicating the OOE logic doesn't cost any more than duplicating an equal sized chunk of cache that ends up being useless because the execution units just puked out the entire pipeline because the compiler couldn't reshuffle instructions well enough. The set of scheduling rules one has to work around for certain chips (*cough* 360 *cough*) are many, complicated, and further screwed up by the 40-60 stage pipeline you get to try and schedule for.

In theory, a JIT could have the same sort of advantage, but it will have to burn cycles doing all of the logic the OOE CPU would be doing, and would chew through tons of memory and cache duplicating any code for the various possibilities. That's after you solve the issue of determining exactly which version of the function to call at any particular point in the program, which can itself vary depending on the data or general state.

The entire push *for* OOE execution is driven by the fact that static compile time instruction scheduling is a Very Difficult Nut To Crack, and most 'solutions' at some point involve a code explosion that will destroy any gains you may have achieved. I'd love to have a compiler that could perfectly schedule any code I could conceive of. I don't believe I will see such a beast in my lifetime.

As a side note, production compilers these days can barely get dead store elimination and math intrinsics right.

tepples: The handhelds are still using codewarrior as the official compiler, the general feeling is that the homebrew people have better tools. Most of the people working on the GC used SN's gcc port, which had it's issues, but it worked, and had a really nice debugger.

by mozz on 2008-03-20 (#31876)

ReaperSMS wrote:
As a side note, production compilers these days can barely get dead store elimination and math intrinsics right.

Oh man, don't get me started on intrinsics. The register allocation for SSE intrinsics in Microsoft's compiler is super stingy. It also can't pass vector wrapper types by value on x86. Trying to use SSE vector intrinsics from C++ is an exercise in frustration.

by ReaperSMS on 2008-03-20 (#31877)

Tell me about it. I've hacked GCC to try and improve the situation, but even that was mostly just adding a new type for them.

The language semantics start to get really nasty when you start trying to pass or return structs by value, when those structs want to be in registers other than the integer GP regs, and that seems to be true on just about any platform, and when the semantics don't screw you over, the compiler and/or ABI will.