...Is something that is almost always mentioned in any discussion about assembly programing in the present. I never see any code comparisons, statistics, or citations or anything; I assume it's just supposed to be obvious enough that it doesn't require an in-depth explanation. Maybe this is due to me being biased (even after learning C and much of C++), but I find this difficult to believe. I don't see how a compiler could generate more efficient code when it has no real intuition of how any specific program operates, and (at least for what I've seen for x86), function calls will always be expensive no matter what, because seemingly no assumptions are made about where / how functions are called.
I thought then that, maybe with how complex instruction timing has become and other headaches with modern computing, that a compiler may be able to write more efficient code this way, but this specific level of optimization should change with just about every different processor, or at least every different microarchitecture, although I don't know how drastically.
A compiler can have intimate knowledge of how to order instructions so that various latencies inside modern CPUs won't add up but get overlapped instead. I know I'll be able to beat a compiler when writing 486 and Pentium code since there's very little that gets in the way but with a P4 and most current x86 CPU I definitely won't be. As far as other modern high performance architectures go, I don't know.
Drew Sebastino wrote:
Maybe this is due to me being biased (even after learning C and much of C++), but I find this difficult to believe. I don't see how a compiler could generate more efficient code when it has no real intuition of how any specific program operates, and (at least for what I've seen for x86), function calls will always be expensive no matter what, because seemingly no assumptions are made about where / how functions are called.
Generating optimal code often requires knowing some things that compilers can know but programmers generally won't, but also requires knowing some things that programmers will know but compilers can't. Further, it requires recognizing that any "optimizations" predicated upon the notion that a programmer won't do X will be counter-productive if a programmer needs to do X.
Unfortunately, development on C compilers like clang and gcc has been taken over by people who mistakenly believe "clever" and "stupid" are antonyms. The authors C Standard didn't think it necessary to say "This Standard makes no attempt to forbid obtuse compiler writers from doing stupid things that make their compilers unsuitable for many purposes". They recognize that principle in the published Rationale, but not within the Standards document itself. Consequently, there's an increasing divergence between the language that became popular in the 1980s, and the dialect that clang and gcc process with optimizations enabled, with the latter being suitable only for a few specialized applications that do not involve processing any potentially-invalid input.
TmEE wrote:
A compiler can have intimate knowledge of how to order instructions so that various latencies inside modern CPUs won't add up but get overlapped instead. I know I'll be able to beat a compiler when writing 486 and Pentium code since there's very little that gets in the way but with a P4 and most current x86 CPU I definitely won't be. As far as other modern high performance architectures go, I don't know.
While it's possible for modern compilers to exploit things about architectures that programmers wouldn't be able to with any reasonable amount of effort, programmers know more than compilers about what needs to be done. Consider, for example, a function which is supposed to accept pointers to two 4096-byte blocks of memory and return 0 if they contain matching contents or any non-zero value if they're don't. One approach would be to simply compare the two blocks a byte at a time, and stop after all 4096 bytes have been compared or a mismatch has found. Another would be to check the pointers for alignment, select one of 64 comparison methods based upon the pointer alignments, and then compare process eight bytes at a time, using whatever shifts are required. The first approach may be more than double the speed of almost anything else in cases where it's necessary to compare many blocks that match except for the first byte. The second approach may be almost eight times as fast as the first in cases where the two blocks match. If the Standard were to include alternate memory-comparison functions which implementations could simply chain to `memcmp` if desired, but which implementations would be encouraged to optimize for different cases, then picking the right routine in various situations could do more to improve performance than anything a clever implementation might hope to accomplish.
While some compiler writers might suggest "profile-driven optimization" as a solution to such issues, that fails to consider the fact that a compiler will generally have no way of distinguishing cases where performance matters from those where it doesn't. If an operation needs to be performed 1000 times, it may be better to have it take 10us 99% of the time and 1000us 1% of the time, than to have it take 100us every time, provided the time for all 1000 operations never exceeds 200ms. If the time for 1000 operations could sometimes reach a full second, however, that may be unacceptable. Such issues can be especially important in cases where performance can depend upon the data a program receives from potentially-malicious sources. If a language has no way of expressing such concepts, compilers cannot be expected to generate optimal code that meets them.
I suspect the idea that compilers can outperform handwritten assembly comes from modern 32- and 64-bit multi-core high-end processors that have super-complex instructions, deep pipelines that must be kept full, multi-level cache, out-of order execution, and things that simply don't apply to the 6502. I've never heard of a 6502 compiler that could remotely compete with hand assembly. Perhaps a great compiler could be written if there were the commercial interest to invest the necessary man-hours, but I still don't think it would produce code that could match, let alone exceed, the performance of code from any skilled 6502 assembly-language programmer.
The idea post-dates most of the 32 bit era.
The compilers started getting competitive with hand written stuff in the 486/pentium era, and have gotten better as the used subset of x86 shrinks down, and the chips become a bit more regular as far as timing goes. A hefty amount is due to the microarchitecture doing a lot of the work for them.
For modern stuff, the only real reason to drop to inline assembly anymore is when you're trying to express concepts that C/C++ really don't want to do, or possibly SIMD. But those areas are shrinking rapidly, and SIMD is far, far easier to deal with via intrinsics.
Compilers are generally pretty good about tedious, mechanical, predictable, repetitive operations like register allocation, loop unrolling, and inlining (though unrolling is a lot more situational now, and arguably has been since branch prediction became a thing). They're at best tied on subexpression elimination, but their hands are tied a bit there by language semantics.
It's been a rather long while since I last ran into a situation where the compiler output wasn't mostly what I would have written anyways, for anything where it mattered, and even those cases there was usually a way to tweak the C/C++ code to make the compiler do the better thing without having to go to assembly.
Garth wrote:
I suspect the idea that compilers can outperform handwritten assembly comes from modern 32- and 64-bit multi-core high-end processors that have super-complex instructions, deep pipelines that must be kept full, multi-level cache, out-of order execution, and things that simply don't apply to the 6502. I've never heard of a 6502 compiler that could remotely compete with hand assembly. Perhaps a great compiler could be written if there were the commercial interest to invest the necessary man-hours, but I still don't think it would produce code that could match, let alone exceed, the performance of code from any skilled 6502 assembly-language programmer.
Actually, the idea substantially predates the 6502. Many optimization techniques would result in code that was very brittle, and would thus need to be substantially reworked if requirements changed. If one were trying to write an assembly-language program to process a two-dimensional array, one would likely have to either write code that would only work for a power-of-two array size, or write less efficient code that would work with any size. If one were instead writing the code in FORTRAN, however, one wouldn't have to change anything other than the array dimension to make the compiler automatically generate whatever code was appropriate for the specified size. Hand-written assembly code that was specifically tailored for a particular array size might outperform the compiler-generated machine code for that size, but assembly code that needed to be easily maintainable could not outperform the machine code generated by a compiler that was not thus encumbered.
Fair 'nuff. My experience is almost entirely on 486+ PC hardware, retro consoles, and PS2/GC-present consoles.
I could certainly see the promise of a sufficiently smart compiler being closer to reality on the sort of machines people wrote FORTRAN code for. It definitely wasn't the case for most of the PC compilers (Watcom excepted).
Console compilers have been a wide range of hit and miss there, with a particularly painful era of false promises in the 360/PS3 era when IBM managed to convince everyone that out of order was a passing fad and the compiler would totally be able to schedule code properly.
supercat wrote:
Garth wrote:
I suspect the idea that compilers can outperform handwritten assembly comes from modern 32- and 64-bit multi-core high-end processors that have super-complex instructions, deep pipelines that must be kept full, multi-level cache, out-of order execution, and things that simply don't apply to the 6502. I've never heard of a 6502 compiler that could remotely compete with hand assembly. Perhaps a great compiler could be written if there were the commercial interest to invest the necessary man-hours, but I still don't think it would produce code that could match, let alone exceed, the performance of code from any skilled 6502 assembly-language programmer.
Actually, the idea substantially predates the 6502. Many optimization techniques would result in code that was very brittle, and would thus need to be substantially reworked if requirements changed. If one were trying to write an assembly-language program to process a two-dimensional array, one would likely have to either write code that would only work for a power-of-two array size, or write less efficient code that would work with any size. If one were instead writing the code in FORTRAN, however, one wouldn't have to change anything other than the array dimension to make the compiler automatically generate whatever code was appropriate for the specified size. Hand-written assembly code that was specifically tailored for a particular array size might outperform the compiler-generated machine code for that size, but assembly code that needed to be easily maintainable could not outperform the machine code generated by a compiler that was not thus encumbered.
I think that's a very small part of programming though; and even then, you could write assembly macros that would generate the optimum code for the needed array size. If you decide to change the array size, there's no need to re-write the code that calls the macros. I've had macro definitions that were 50 lines or more yet only laid down two ML instructions, depending on all the conditions the macros watched for.
Garth wrote:
I think that's a very small part of programming though; and even then, you could write assembly macros that would generate the optimum code for the needed array size. If you decide to change the array size, there's no need to re-write the code that calls the macros. I've had macro definitions that were 50 lines or more yet only laid down two ML instructions, depending on all the conditions the macros watched for.
A macro assembler that can let the programmer specify and handle all of the proper cases will generally be more complicated than a compiler that can simply generate the proper code for the proper cases, and even when using such tools the level of programmer effort required to handle such cases will be greater than would have been needed in a higher-level language.
I gave array sizes as a simple example of something I know compilers could do even in the 1960s. Register usage and constant reuse would actually be even more significant issues, but I'm not sure what the state of the art was when the 6502 was introduced. If one writes something in FORTRAN equivalent to the C code:
Code:
#define SCALE1 2.5f
#define SCALE2 1.5f
double adjust(float x, float y)
{
return x*SCALE1 + y*(SCALE2+1.0f);
}
a compiler could exploit the fact that the same value 2.5f would be used in both places, but exploiting that in assembly code while still having the program work if the constants changed would be harder. And given something equivalent to:
Code:
int array1[1000];
int test(int n)
{
for (int i=0; i<100; i++)
{
array1[i] += 1;
someFunction();
array1[i]--;
}
it would likely be worthwhile to keep the address of array1[i] in a "marching pointer" register if it wouldn't get disturbed by the call to someFunction(), but it wouldn't be worth putting it into a register that would need to be saved before the function call and restored afterward. I don't think compilers of the 1960s would have spent the effort to examine the callers of someFunction() to decide what registers should be left alone, but I would expect they could had earlier-compiled functions report any registers that they would "naturally" leave undisturbed.
"Modern compilers will outperform handwritten assembly." I think it depends on the program being written, on the compiler, on the computer (or VM) being targeted, and on the programmer. (In some cases, I know because I have tried it, that the handwritten assembly code will work better.)
I think the cases where people even try to hand-write assembly, and to optimize it, are now so rare that basically no engineers are able to do that anymore. The performance gain is not just the code but the time it takes to actually hand-write highly performing assembly code. In the industry it's not worth paying an engineer to take two weeks optimizing his assembly routine when you could just use sub-optimal code and better, hardware instead. The hardware might be more expensive but the lower time to market finally makes this solution the most performing.
In the case of retro consoles it's different because the hardware is fixed and we have to push it to it's limit.
Bregalad wrote:
I think the cases where people even try to hand-write assembly, and to optimize it, are now so rare that basically no engineers are able to do that anymore. The performance gain is not just the code but the time it takes to actually hand-write highly performing assembly code. In the industry it's not worth paying an engineer to take two weeks optimizing his assembly routine when you could just use sub-optimal code and better, hardware instead. The hardware might be more expensive but the lower time to market finally makes this solution the most performing.
Even on modern microcontrollers, there are many cases where spending a few hours on hand-written assembly may offer some major performance improvements. Most of the improvements that would be achievable within a couple weeks wouldn't take more than a few hours to achieve. A bigger issue is determining what dialect of C makes the most sense, and ensuring that one's compiler will actually process it without jumping the rails.
For example, if one needs a function that takes a pointer to a 32-bit-aligned byte address and interprets the four bytes there as a big-endian 32-bit unsigned int, such a function might be written a couple of ways for e.g. the Cortex-M0
Code:
uint32_t fetch32le_v1(void *p)
{
uint8_t *pp = p;
return (pp[0]<<24) | (pp[1]<<16) | (pp[2]<<8) | pp[3];
}
uint32_t fetch32le_v2(void *p)
{
uint32_t result = *(uint32_t*)p;
return (result<<24) | ((result<<8) & 0xFF0000) | ((result>>8) & 0xFF00) | (result >> 24);
}
A compiler that has no way of knowing that `p` will be word-aligned couldn't process fetch32le_v1 using less than four byte reads, three shifts, and three "or" instructions. GCC's optimizer uses those operations to assemble the value in little-endian order and then uses a byte-reverse instruction. When using the -fno-strict-optimization dialect, the fetch32le_v2 function could achieve the required behavior using two instructions plus the return rather than ten (or eleven), but in gcc's "preferred" dialect the latter function won't be reliable.
This seems more a problem with C language being inapropriate to describe the program, rather than the compiler failing at optimizing the assembly.
Bregalad wrote:
This seems more a problem with C language being inapropriate to describe the program, rather than the compiler failing at optimizing the assembly.
Standard C maybe, but that's also why most compilers have extensions like
__declspec(align), or
__attribute__ ((aligned)). There's a lot of very practical ways to address domain specific problems when you actually allow domain specific solutions.
Also if you wait long enough things that everybody implements anyway do usually
eventually make it into the standard spec.
rainwarrior wrote:
Also if you wait long enough things that everybody implements anyway do usually
eventually make it into the standard spec.
Only if it's something that the authors of the Standard hadn't expected to be supported by any compiler writers who weren't being deliberately obtuse. Unfortunately, the authors of the Standard are unwilling to make clear that the Standard was never intended to forbid all of the stupid things compilers might do that would make them unsuitable for many purposes, but instead relied upon compiler writers being capable of exercising sound judgment in such things. The authors of the Standard have said that they'd expect most compilers to process something like:
Code:
unsigned mul_mod_65536(unsigned short x, unsigned short y) { return (x*y) & 0xFFFF; }
There was no need to mandate that the multiplication be processed as unsigned because implementations where it would make sense to process it that way would behave in that fashion with or without a mandate. The only situations where the authors would have expected a mandate would matter would be those where some other treatment would be more useful for some reason (e.g. on ones'-complement hardware, where making the result be accurate for values larger than INT_MAX would require an extra correction step), and in those cases the authors of the Standard saw no reason to prohibit the more useful treatment.
rainwarrior wrote:
Standard C maybe, but that's also why most compilers have extensions like
__declspec(align), or
__attribute__ ((aligned)). There's a lot of very practical ways to address domain specific problems when you actually allow domain specific solutions.
Also if you wait long enough things that everybody implements anyway do usually
eventually make it into the standard spec.
This is also extremely ugly and unelegant.
Bregalad wrote:
This is also extremely ugly and unelegant.
In practice, not really? You put the declspec/attribute in a single typedef and reuse it as much as you want with whatever name you choose. It's one line of code, probably in some header you don't need to look at much.
Code:
// define once
typedef __declspec(align(8)) unsigned int auint;
// use everywhere
auint x;
auint y,z;
__attribute__ can certainly get pretty damn ugly. Here's one you have to do squelch warnings on newer gcc and clang when presented with functions like
vfprintf/vsprintf() and friends, when using
-Wformat-nonliteral (which I believe is brought in via
-Wall or
-Wextras, I forget which):
Code:
__attribute__((__format__ (__printf__, 1, 0)))
void status(const char *fmt, ...)
{
va_list argp;
va_start(argp, fmt);
vfprintf(stdout, fmt, argp);
va_end(argp);
fflush(statusfp);
}
The
1 there is critical: it's the positional location of the formatting parameter. So, the number has to change depending on where that
fmt argument happens to be, order-wise. Eventually I ceased writing/using "helper functions" of this sort and used the libc ones directly, solely to avoid this madness. (Oh, also, tools like Valgrind tend to get really pissy about all of this -- and rightfully so.)
I certainly wish there was a... I don't know how to put it -- cleaner and more "simply declared"? -- way to request alignment on data types (esp. within structs). I imagine Bregalad feels similarly. Ah well, no PL is perfect; at least we have *something*.
koitsu wrote:
I certainly wish there was a... I don't know how to put it -- cleaner and more "simply declared"? -- way to request alignment on data types (esp. within structs).
The third link in my post above was to the C++11
alignas keyword, which is exactly that. Wish granted.
I brought up __declspec and __attribute__ to point out that compiler authors have been supplementing the standard to address needs like this since the very beginning.