ReaperSMS wrote:
The chip can usually do a better job of scheduling than the compiler, as it has a lot more information regarding what addresses everything is dealing with, so aliasing is not much of an issue.
Technically the compiler can look at a lot more, but language semantics usually restrict what it can do
C99 is supposed to help solve this by adding a new keyword
restrict that specifies the semantics of a function with respect to aliasing.
ReaperSMS wrote:
and it can't really look outside of the current function most of the time.
It can if the other function is in the same source code file and
static. That's at least part of how it decides to consider something for inlining.
mozz wrote:
But what I meant by "modern CPU designs" was pretty much all x86-based chips made since the Pentium Pro in the mid-90's.
You and blargg are both right. The P6 architecture as amended is comparatively modern (even if a decade old), but the underlying 8086 bytecode is ancient.
mozz wrote:
They use register renaming to a file of 40 or more internal registers, and break all of the quirky and complex x86 instructions up into simple, RISC micro-ops before executing them.
On the other hand, the MIPS, PowerPC, and ARM bytecodes are designed to be decoded to micro-ops using less circuitry. That's pretty much the big difference between CISC and RISC, but at a slight cost in code density for RISC. The RISC people have made various compromises to improve code density, such as the Thumb bytecode.
mozz wrote:
As for OOE, current x86 chips can have like 96 micro-ops in flight and issue 3 or more instructions every cycle, so they are able to start lots of new instructions while waiting for already-running instructions to finish, even across things like call/return boundaries.
I can see how handling out-of-order across basic blocks might be a win on a system with few interrupts. But is this true of typical applications?
mozz wrote:
It's hard to impossible for a compiler to match that level of performance on an in-order execution chip. But I'd say the three cores of the 360 more than makes up for the difference!
That and especially the two threads per core.
blargg wrote:
The only hard information is how fast a particular pair of algorithms execute on a particular processor
What do you want to see? A space-time-accuracy comparison of CORDIC and LUT approaches to cosine calculation on a Nintendo DS, perhaps?
Nessie wrote:
I believe [accessing a buffer allocated statically] uses absolute addressing (something like 0x00010000+0x4000) whereas [accessing a buffer created with malloc()] would have to use a register as a pointer (esi+0x4000). So, the first option would probably be faster since it's easier to pipeline when no pointer register is involved.
On ARM CPUs such as that of the DS, an immediate quantity must be an 8-bit number rotated left by 0, 2, 4, 6, ..., 28, or 30 bits. Any other constant must be stored after the return instruction and loaded relative to the program counter. RAM is at 0x02000000 through 0x023FFFFF, and the absolute address of a statically allocated variable might be 0x0201acc4, which doesn't fit the immediate rule.
blargg wrote:
If you write to the array then access it again, some compilers won't be smart enough to determine that the a and b pointers didn't change, so it will re-read them before the next access; in some cases, the compiler can't prove that they won't change, so it must re-read them.
I thought that's what the
volatile keyword was for: to specify that memory won't change except across a function call, unless the keyword is present. There's also the tradeoff of rereading the address from the data section vs. rereading it from the code.
Split.