Introduction to PPC assembly
 
Preface

The following notes show how to start with a simple C code and, step by step, end with assembly code 

Indeed when I started to write these notes I had a PPC 7457. Today I have some 7410. This is not a big issue: the code that follows could be built with most of the available compiler, linker and assembler. Eventually few changes should be applied to the code. However, for the sake of generality, I'm trying not make an explicit reference to any specific compiler, assembler or linker. 

By the way, the above mentioned processors are just derivatives of some other processor. The three main progenitors referenced in this document are the MPC750 (example of derivatives are MPC740 and MPC755), MPC7400 (example of derivative is the MPC7410), and MPC7450 (example of derivative is the MPC7457). 

STEP 1: A Plain C Example in one module

Let's start with a simple example in C 
            		
#include <stdio.h> 

int dosomething(int,int);
					
int main(int argc, char ** argv){
    int a,b,r;
	
    a=1;
    b=4;
    
    r=dosomething(a,b);
    printf("\ndosomething(%d,%d)=%d\n",a,b,r);
    return 0;
    		
}

int dosomething(int x, int y){

	int z;
    
    z=x+y;
	return z;
}
				

Now I can build the code 
                    

                    ${CC} test.c –o test01.ppc

                
I used ${CC} so everyone can place its own. 
Running the application the output is 

                    

                    dosomething(1,4)=5   
             

				
STEP 2: A Plain C Example in two modules

Now we split the code into two modules. In the first one I place the following 

				
#include <stdio.h>

extern int dosomething(int,int);
					
int main(int argc, char ** argv){
    
    int a,b,r;
    
    a=1;
    b=4;

    r=dosomething(a,b);
    printf("\ndosomething(%d,%d)=%d\n",a,b,r);
    return 0;
}       		
				
While in the second module I place 

					    
int dosomething(int x, int y){
    int z;
    
    z=x+y;
    return z;
}
            	
				
Building the application 

                    
                    ${CC} dosomething.c test.c -o test02.ppc
                
				

and running the application the result doesn't change. What's worth to point out is that now there is a code that can be written in assembly leaving the main module in plain C. 

STEP 3: A Mixed C-Assembly Example in two modules

I will leave the main module of the previous example in plain C. For the dosomething() function I will create a dosomething.s (or .asm) file with the following code 

.text                  ; code section
    .global dosomething
dosomething:
    or  r7,  r3,  r3
    or  r8,  r4,  r4
   add  r9,  r7,  r8
	mr  r3,  r9
#exit
   bclr 20,   0   ;( exit )
            	
				
I will return on the meaning of the code in a while. For now, it is important to understand that this code can be built and it works in the same way. 

                
                    ${CC} dosomething.s test.c -o test03.ppc
            	
				

Now it's time to understand what we wrote, and which is the difference of the executable of the second step and this last one.

STEP 4: Understanding the differences

Let me start first with test02.ppc, which is the test in two plain C modules. I'm going to use a debugger tools, let me say ${GDB}, which can be any gdb-like tools.

The command is something like;

                    
                    ${GDB} test02.ppc
            	
            	

After some version and copyright information we are ready to disassembly our function via the following command: 

                    
disas dosomething
Dump of assembler code for function stext:
0x40000000 <stext+ 0>:   or   r11, r3, r3
0x40000004 <stext+ 4>:   or   r12, r4, r4
0x40000008 <stext+ 8>:  add   r12,r11,r12
0x4000000c <stext+12>:   or    r3,r12,r12
0x40000010 <stext+16>: bclr    20, 0
0x40000014 <stext+20>:  .long 0
0x40000018 <stext+24>:  .long 0
0x4000001c <stext+28>:  .long 0
End of assembler dump.					
            	
            	

Then we do the same for the one written in Assembly. 

                   
${GDB} test03.ppc

...
                
disas dosomething
Dump of assembler code for function stext:
0x40000000 <stext+ 0>:   or    r7, r3, r3
0x40000004 <stext+ 4>:   or    r8, r4, r4
0x40000008 <stext+ 8>:  add    r9, r7, r8
0x4000000c <stext+12>:   or    r3, r9, r9
0x40000010 <stext+16>: bclr    20, 0
0x40000014 <stext+20>:  .long 0
0x40000018 <stext+24>:  .long 0
0x4000001c <stext+28>:  .long 0
End of assembler dump.					
					

Just a matter of registers used, but the code is the same. It's interesting noting that however the executable sizes are different. I will return later back on this fact. At the moment, we need to understand the code. 

EABI

To understand how to use the registers we need to keep in mind that we are dealing with a RISC architecture. Hence we have a lot of registers available. To be able to write portable code, there is the need of a convention for register usage, parameter passing, stack organization, small data areas, and other things. This set of conventions is known as Embedded Application Binary Interface (EABI) 

First of all, let's see which data types are available


Data Types
              
Byte        1 byte
HalfWord    2 bytes
Word        4 bytes
DoubleWord  8 bytes
QuadWord   16 bytes
                     

Register Usage

There are mainly two classes of registers: volatile registers and nonvolatile ones. Volatile register don't have to be preserved across function calls, while nonvolatile registers should be preserved. Among the nonvolatile registers there is a set of dedicated registers. 

The above classes are applicable to the next kind of registers: in fact, we have 32 general purpose registers, GPRs and 32 floating point registers, FPRs. Moreover there are also special purpose registers,(LR,CTR,XER), conditional CRs registers, and floating point status and control registers FPSCR. All of them are 32 bit with the exception of the floating point (64 bit), each of the CR (4 bit), and some of the special purpose register (32 / 64 depending on the implementation) 

The following table is referred to the EABI, but care must be taken because there are also other ABI interfaces. For instance, IBM has defined three ABIs for the PowerPC architecture (AIX ABI for big-endian 32-bit PowerPC processors which is nearly the same as the PowerOpen ABI, Windows NT, Workplace ABIs for little-endian 32-bit PowerPC processors). Other ABIs have been defined for other Operating Systems. 

GPR0      Volatile            Depends on the context
GPR1      Volatile Dedicated  Stack pointer (SP)
GPR2      Volatile Dedicated  Read-only small data area anchor
GPR3      Volatile            Argument passed and/or returned value
GPR4      Volatile            Argument passed and/or returned value
GPR5      Volatile            Argument passed 
...
GPR10     Volatile            Argument passed 
GPR11     Volatile            
GPR12     Volatile            
GPR13  Nonvolatile Dedicated  Read-only small data area anchor
GPR14  Nonvolatile         
...
GPR31  Nonvolatile         

FPR0      Volatile            Depends on the context
FPR1      Volatile            Argument passed and/or returned value
FPR2      Volatile            Argument passed
...
FPR8      Volatile            Argument passed
FPR9      Volatile            
...
FPR13     Volatile            
FPR14  Nonvolatile         
...
FPR31  Nonvolatile         

CR0       Volatile            
CR1       Volatile            
CR2    Nonvolatile         
CR3    Nonvolatile         
CR4    Nonvolatile         
CR5       Volatile            
...
CR7       Volatile             

            	
All the others are volatile registers. 

Stack Frame

There is no push/pop instruction for the stack. Each function calling another function (i.e. is not a leaf function) or that is going to modify a nonvolatile register should create a stack frame from memory. The stack frame is created by a function's prologue code and destroyed in its epilogue code. An example of function's prologue could be the following one 
          
dosomething: mflr r0 ; Get Link register
stwu  r1,-88(r1) ; Save Back chain and move SP
stw   r0,+92(r1) ; Save Link register
stmw r28,+72(r1) ; Save 4 non-volatiles r28-r31
...
				

And here its epilogue 

            		
...
lwz   r0,+92(r1) ; Get saved Link register
mtlr  r0         ; Restore Link register
lmw  r28,+72(r1) ; Restore non-volatiles
addi  r1,  r1,88 ; Remove frame from stack
bclr  20,0

				
Another Example

Let’s see how the next C function, which swaps two floating points, and where pointers to float are passed as arguments, is resolved in Assembly. 

void floatSwap(float* f1, float* f2){
    float tmp;

    tmp=*f1;
    *f1=*f2;
    *f2=tmp;
}	

  	
Looking at the assembly code generated by the compiler for this plain C function we have 

                
0x00000140 <floatSwap+0>:    or  r11, r3, r3
0x00000144 <floatSwap+4>:    or  r12, r4, r4
0x00000148 <floatSwap+8>:   lfs  fr0, 0(r11)
0x0000014c <floatSwap+12>:   lfs fr13, 0(r12)
0x00000150 <floatSwap+16>:  stfs fr13, 0(r11)
0x00000154 <floatSwap+20>:  stfs  fr0, 0(r12)
0x00000158 <floatSwap+24>:  bclr   20, 0
0x0000015c <floatSwap+28>: .long 0
 
            	

Now I can build a two modules application to run the above function. It's enough a main that build an array of floating points to be swapped. The main should take the time spent for swapping a fixed amount of floating point data. Doing so, I can than compare the times for the same swapping function but written in assembly. 

The main will have an extern declaration for the swapping function and a prototype for a function needed to show the time. 

            	
...
extern floatSwap(float *, float *);
static void show_times(int *,int *,char *,int);

            	

Than a main body to create two arrays to be swapped. 

...
int main(int argc, char ** argv){

    float a[1024], b[1024];
	...

    for(i=0;i<1024;i++){
        a[i]=(i+1)/(float)1000;
        b[i]=-a[i];
    }

	for(i=0;i<10;i++)
        printf("\n a[%d]=%f b[%d]=%f",i,a[i],i,b[i]);

	GET_TIME(time_start[0]);
    for(i=0;i<1024;i++)	
        floatSwap(&a[i],&b[i]);
	GET_TIME(time_end[0]);

	printf("\n ");

	for(i=0;i<10;i++)
		printf("\n a[%d]=%f b[%d]=%f",i,a[i],i,b[i]);
		
	show_times(time_start, time_end," ",1);
	return 0;
}
				

Running the application the output is 

				
a[0]=0.001000 b[0]=-0.001000
a[1]=0.002000 b[1]=-0.002000
a[2]=0.003000 b[2]=-0.003000
a[3]=0.004000 b[3]=-0.004000
a[4]=0.005000 b[4]=-0.005000
a[5]=0.006000 b[5]=-0.006000
a[6]=0.007000 b[6]=-0.007000
a[7]=0.008000 b[7]=-0.008000
a[8]=0.009000 b[8]=-0.009000
a[9]=0.010000 b[9]=-0.010000
 				
a[0]=-0.001000 b[0]=0.001000
a[1]=-0.002000 b[1]=0.002000
a[2]=-0.003000 b[2]=0.003000
a[3]=-0.004000 b[3]=0.004000
a[4]=-0.005000 b[4]=0.005000
a[5]=-0.006000 b[5]=0.006000
a[6]=-0.007000 b[6]=0.007000
a[7]=-0.008000 b[7]=0.008000
a[8]=-0.009000 b[8]=0.009000
a[9]=-0.010000 b[9]=0.010000  in 1 st pass : 25714 nanoseconds

				

Now just replacing the function performing the swap with and following one coded in assembly 

.text
    .global floatSwap
floatSwap:
    or    r8, r3,r3
    or    r9, r4,r4
    lfs   fr0,0(r8)
    lfs  fr13,0(r9)
    stfs fr13,0(r8)
    stfs  fr0,0(r9)
    bclr  20,0

				

Running this latter application, the output is 

 			    
a[0]=0.001000 b[0]=-0.001000
a[1]=0.002000 b[1]=-0.002000
a[2]=0.003000 b[2]=-0.003000
a[3]=0.004000 b[3]=-0.004000
a[4]=0.005000 b[4]=-0.005000
a[5]=0.006000 b[5]=-0.006000
a[6]=0.007000 b[6]=-0.007000
a[7]=0.008000 b[7]=-0.008000
a[8]=0.009000 b[8]=-0.009000
a[9]=0.010000 b[9]=-0.010000

a[0]=-0.001000 b[0]=0.001000
a[1]=-0.002000 b[1]=0.002000
a[2]=-0.003000 b[2]=0.003000
a[3]=-0.004000 b[3]=0.004000
a[4]=-0.005000 b[4]=0.005000
a[5]=-0.006000 b[5]=0.006000
a[6]=-0.007000 b[6]=0.007000
a[7]=-0.008000 b[7]=0.008000
a[8]=-0.009000 b[8]=0.009000
a[9]=-0.010000 b[9]=0.010000  in 1 th pass : 25714 nanoseconds

             

Again the same results and with the same timing. Now I try to reduce the code just to see if I can speed up the code execution. The following assembly code is a quick and dirty version for the same function. 

.text
    .global floatSwap
floatSwap:
    lfs   fr0,0(r3)
    lfs  fr13,0(r4)
    stfs fr13,0(r3)
    stfs  fr0,0(r4)
    bclr  20,0
 				            		
             

Actually the only thing I did is to remove the copy of the arguments, i.e. the floating point pointers, in the registers r8 and r9. I did it just to see some better performance. Here the results. 

 			    
a[0]=0.001000 b[0]=-0.001000
a[1]=0.002000 b[1]=-0.002000
a[2]=0.003000 b[2]=-0.003000
a[3]=0.004000 b[3]=-0.004000
a[4]=0.005000 b[4]=-0.005000
a[5]=0.006000 b[5]=-0.006000
a[6]=0.007000 b[6]=-0.007000
a[7]=0.008000 b[7]=-0.008000
a[8]=0.009000 b[8]=-0.009000
a[9]=0.010000 b[9]=-0.010000

a[0]=-0.001000 b[0]=0.001000
a[1]=-0.002000 b[1]=0.002000
a[2]=-0.003000 b[2]=0.003000
a[3]=-0.004000 b[3]=0.004000
a[4]=-0.005000 b[4]=0.005000
a[5]=-0.006000 b[5]=0.006000
a[6]=-0.007000 b[6]=0.007000
a[7]=-0.008000 b[7]=0.008000
a[8]=-0.009000 b[8]=0.009000
a[9]=-0.010000 b[9]=0.010000  in 1 th pass : 23308 nanoseconds

            
Notes

There are some notes it is worth to mention here. First the name of the registers could be different depending on the Assembler used. For instance it is possible to find out fp or fr as prefix for the floating point registers. To find out how your compiler and assembler works, a quick way is to build a simple function in C and then to have a look at the disassembled code using gdb or any other debugger tools. By the way, remember that assembly is a language while assembler is the tool that translate assembly code in machine code. 

A note just for those who where used to work with the Motorola 68K. Now it's more like the Intel style: first comes the destination and then the source. 

A note about the optimisation. In the processor there are independent units. For instance the Integer units and the floating point units are independent. In case you have some code such as 

lfsx   fr0, r7,r20 ; load 1st FP32 value 
lfsx  fr12, r7,r21 ; load 2nd one 
stfsx fr12, r7,r20 ; store latter value on 1st addr
stfsx  fr0, r7,r21 ; store former value on 2nd addr             

mullw r15, r9, r3
divw  r16,r15, r4
mullw r16,r16, r4

			     

you can do some optimisation considering that there is a part of the code (the first 4 lines) that is handling floating point values while the last free lines are working on general purpose registers. Moreover the results from the first 4 is not needed to the other lines. Having a floating point unit working independently from the integer unit, I can mix the line to make the two units working in parallel. In this way I speed up the code. 

lfsx   fr0, r7,r20 ; load 1st FP32 value 
mullw r15, r9, r3
lfsx  fr12, r7,r21 ; load 2nd one 
divw  r16,r15, r4
stfsx fr12, r7,r20 ; store latter value on 1st addr
mullw r16,r16, r4
stfsx  fr0, r7,r21 ; store former value on 2nd addr             

			     

...TO BE FINISHED: WORK IN PROGRESS...