DRW wrote:
I'm just a bit confused why even the assigned value makes a difference in calculating the array itself.
Well, honestly it's not like the operation of cc65's optimizer is at all obvious. It generates the same initial assembly code whether or not
-O is used, and then after doing this
-O does a series of pattern match steps to refactor the already generated assembly code. It's a bit of a backward approach to the problem, and
really weird. It would be much better to start optimizing at a higher level, but I'm not going to fantasize too much about that. This is what the compiler we
have does.
If you want to understand what it does, like I said just above, you can use
--debug-opt-output to observe the process if you're curious. It will show you exactly what the initial generated assembly code is, and every step it takes to optimize.
If you want an example:
Code:
// starting C line
(oam+3)[oam_pos] = x+5;
; 1. initial generated assembly:
lda #<(_oam+3)
ldx #>(_oam+3)
clc
adc _oam_pos
bcc L02DF
inx
L02DF: jsr pushax
ldy #$04
ldx #$00
lda (sp),y
jsr incax5 ; this is the +5 operation, which critically breaks up the optimization pattern (see below)
ldx #$00
ldy #$00
jsr staspidx
; 2. OptAdd5
lda #<(_oam+3)
ldx #>(_oam+3)
clc
adc _oam_pos
bcc L02DF
inx
L02DF: jsr pushax
ldy #$04
ldx #$00
lda (sp),y
clc
adc #$05 ; jsr incax5 replaced with an inline add
ldx #$00
ldy #$00
jsr staspidx
; 3. OptUnusedLoads
lda #<(_oam+3)
ldx #>(_oam+3)
clc
adc _oam_pos
bcc L02DF
inx
L02DF: jsr pushax
ldy #$04 ; unused ldx #00 eliminated
lda (sp),y
clc
adc #$05
ldy #$00 ; unused ldx #00 eliminated
jsr staspidx
; 4. OptStackOps
lda #<(_oam+3)
ldx #>(_oam+3)
clc
adc _oam_pos
bcc L02DF
inx
L02DF: sta ptr1 ; jsr pushax / staspidx (temporary pointer on stack) replaced by keeping it in ptr1
stx ptr1+1
ldy #$02
lda (sp),y
clc
adc #$05
ldy #$00
sta (ptr1),y
At that point, it's out of patterns to match and stops optimizing.
For comparison, without the intervening +5:
Code:
// starting C line
(oam+3)[oam_pos] = x;
; 1. initial generated assembly
lda #<(_oam+3)
ldx #>(_oam+3)
clc
adc _oam_pos
bcc L02DF
inx
L02DF: jsr pushax
ldy #$04
ldx #$00
lda (sp),y
ldy #$00
jsr staspidx
; 2. OptPtrStore2
lda #<(_oam+3)
ldx #>(_oam+3)
ldy #$02
ldx #$00
lda (sp),y
ldy _oam_pos
sta _oam+3,y ; the temporary pointer on stack is eliminated, and the index add is replaced with Y index
; 3. OptUnusedLoads
ldy #$02 ; unused ldx #00 eliminated
lda (sp),y
ldy _oam_pos
sta _oam+3,y
So you can see the pattern that failed to apply is called
OptPtrStore2.
The optimization patterns are each a function in the cc65 source. They tend to have a good explanation of the operation. Looking up
OptPtrStore2:
Code:
unsigned OptPtrStore2 (CodeSeg* S)
/* Search for the sequence:
**
** clc
** adc xxx
** bcc L
** inx
** L: jsr pushax
** ldy yyy
** ldx #$00
** lda (sp),y
** ldy #$00
** jsr staspidx
**
** and replace it by:
**
** sta ptr1
** stx ptr1+1
** ldy yyy-2
** ldx #$00
** lda (sp),y
** ldy xxx
** sta (ptr1),y
**
** or by
**
** ldy yyy-2
** ldx #$00
** lda (sp),y
** ldy xxx
** sta (zp),y
**
** or by
**
** ldy yyy-2
** ldx #$00
** lda (sp),y
** ldy xxx
** sta label,y
**
** or by
**
** ldy yyy-2
** ldx #$00
** lda (sp),y
** ldy xxx
** sta $xxxx,y
**
** depending on the code preceeding the sequence above.
*/
As you can see, this pattern was written to match specifically a simple fetch from the stack and store. There's two other variations of this type of pattern, you can read about them all in
the header.
The problem is simply that each of them can only match a very simple pattern. It's looking for a specific beginning, middle, and end. You can't just start adding arbitrary extra code in the middle (i.e. the expression to be resolved), that new stuff in the middle has to fit a pattern that's known to be safe to optimize. Probably possible to write such a thing, but
it's complicated and difficult... so, instead only this bunch of simpler cases were written.
When you put an expression in a temporary variable first, that code gets generated
before the array access portion, so doesn't interfere with the simple "generate address, fetch, store" pattern that these are capable of matching.
Anyhow, that's just one example of how to analyze cc65's optimizer. It's really all spelled out by
--debug-opt-output, so if you want to know about any specific cases you're dealing with, that's the tool to use.
Still probably easier to just rewrite offending code in assembly as needed
* than it is to try and understand whatever this byzantine optimizer is doing, though.
* ...and don't worry about it when it's not needed.