PPC Assembly

Preface

The following notes are quite old. However what is worth showing here, is the method that can be applied to learn any Assembly langauge. What is following, is the original note I wrote.

The following notes show how to start with a simple C code and, step by step, end up with assembly code.

Indeed when I started to write these notes I had a PPC 7457. Then I had some 7410. This was not a big issue: the code that follows could be built with most of the available compiler, linker and assembler. Eventually few changes should be applied to the code. However, for the sake of generality, I'm trying not make an explicit reference to any specific compiler, assembler or linker.

By the way, the above mentioned processors are just "children" of some other processor. The three main progenitors referenced in this document are the MPC750 (example of derivatives are MPC740 and MPC755), MPC7400 (example of derivative is the MPC7410), and MPC7450 (example of derivative is the MPC7457).

STEP 1:A Plain C Example in one module

Let's start with a simple example in C.

#include <stdio.h>

int dosomething(int,int);

int main(int argc, char ** argv){
int a,b,r;

a=1;
b=4;

r=dosomething(a,b);
printf("\ndosomething(%d,%d)=%d\n",a,b,r);
return 0;

}
int dosomething(int x, int y){

int z;

z=x+y;
return z;
}

Now I can build the code

${CC} test.c -o test01.ppc

I used ${CC} so everyone can place its own.
Running the application the output is

dosomething(1,4)=5

STEP 2:A Plain C Example in two modules

Now we split the code into two modules. In the first one I placed the following:

#include <stdio.h>

extern int dosomething(int,int);

int main(int argc, char ** argv){

int a,b,r;

a=1;
b=4;

r=dosomething(a,b);
printf("\ndosomething(%d,%d)=%d\n",a,b,r);
return 0;

}

While in the second module I placed

int dosomething(int x, int y){
int z;

z=x+y;
return z;
}

Here it is the output from building the application

${CC} dosomething.c test.c -o test02.ppc

and running the application the result doesn't change. What's worth pointing out is that now there is a code that can be written in assembly leaving the main module in plain C.

STEP 3: A Mixed C-Assembly Example in two modules

I'm leaving the main module of the previous example in plain C. For the dosomething() function I will create a dosomething.s (or .asm) file with the following code

.text ;code section
.global dosomething
dosomething:
or r7, r3, r3
or r8, r4, r4
add r9, r7, r8
mr r3, r9
#exit
bclr 20, 0 ;( exit )

I will return on the meaning of the code in a while. For now, it is important to understand that this code can be built and it works in the same way.

${CC} dosomething.s test.c -o test03.ppc

Now it's time to understand what we wrote, and which is the difference of the executable of the second step and this last one.

STEP 4: Understanding the differences

Let me start first with test02.ppc, which is the test in two plain C modules. I'm going to use a debugger tools, let me say ${GDB}, which can be any gdb-like tools.

The command is something like:

${GDB} test02.ppc

After some version and copyright information we are ready to disassembly our function via the following command:

disas dosomething
Dump of assembler code for function stext:
0x40000000 <stext+ 0>: or r11, r3, r3
0x40000004 <stext+ 4>: or r12, r4, r4
0x40000008 <stext+ 8>: add r12,r11,r12
0x4000000c <stext+12>: or r3,r12,r12
0x40000010 <stext+16>: bclr 20, 0
0x40000014 <stext+20>: .long 0
0x40000018 <stext+24>: .long 0
0x4000001c <stext+28>: .long 0
End of assembler dump.

Then we do the same for the one written in Assembly.

${GDB} test03.ppc

...
disas dosomething
Dump of assembler code for function stext:
0x40000000 <stext+ 0>: or r7, r3, r3
0x40000004 <stext+ 4>: or r8, r4, r4
0x40000008 <stext+ 8>: add r9, r7, r8
0x4000000c <stext+12>: or r3, r9, r9
0x40000010 <stext+16>: bclr 20, 0
0x40000014 <stext+20>: .long 0
0x40000018 <stext+24>: .long 0
0x4000001c <stext+28>: .long 0
End of assembler dump.

Just a matter of registers used, but the code is the same. It's interesting noting that however the executable sizes are different. I will return later back on this fact. At the moment, we need to understand the code.

EABI

To understand how to use the registers we need to keep in mind that we are dealing with a RISC architecture. Hence we have a lot of registers available. To be able to write portable code, there is the need of a convention for register usage, parameter passing, stack organization, small data areas, and other things. This set of conventions is known as Embedded Application Binary Interface (EABI)

First of all, let's see which data types are available

Data Types

Byte 1 byte
HalfWord 2 bytes
Word 4 bytes
DoubleWord 8 bytes
QuadWord 16 bytes

Register Usage
There are mainly two classes of registers: volatile registers and nonvolatile ones. Volatile register don't have to be preserved across function calls, while nonvolatile registers should be preserved. Among the nonvolatile registers there is a set of dedicated registers.

The above classes are applicable to the next kind of registers: in fact, we have 32 general purpose registers, GPRs and 32 floating point registers, FPRs. Moreover there are also special purpose registers,(LR,CTR,XER), conditional CRs registers, and floating point status and control registers FPSCR. All of them are 32 bit with the exception of the floating point (64 bit); each of the CR (4 bit), and some of the special purpose register (32 / 64 depending on the implementation)

The following table is referred to the EABI, but care must be taken because there are also other ABI interfaces. For instance, IBM has defined three ABIs for the PowerPC architecture (AIX ABI for big-endian 32-bit PowerPC processors which is nearly the same as the PowerOpen ABI, Windows NT, Workplace ABIs for little-endian 32-bit PowerPC processors). Other ABIs have been defined for other Operating Systems.

GPR0 Volatile Depends on the context
GPR1 Volatile Dedicated Stack pointer (SP)
GPR2 Volatile Dedicated Read-only small data area anchor
GPR3 Volatile Argument passed and/or returned value
GPR4 Volatile Argument passed and/or returned value
GPR5 Volatile Argument passed
...
GPR10 Volatile Argument passed
GPR11 Volatile
GPR12 Volatile
GPR13 Nonvolatile Dedicated Read-only small data area anchor
GPR14 Nonvolatile
...
GPR31 Nonvolatile

FPR0 Volatile Depends on the context
FPR1 Volatile Argument passed and/or returned value
FPR2 Volatile Argument passed
...
FPR8 Volatile Argument passed
FPR9 Volatile
...
FPR13 Volatile
FPR14 Nonvolatile
...
FPR31 Nonvolatile

CR0 Volatile
CR1 Volatile
CR2 Nonvolatile
CR3 Nonvolatile
CR4 Nonvolatile
CR5 Volatile
...
CR7 Volatile

All the others are volatile registers.

Stack Frame
There is no push/pop instruction for the stack. Each function calling another function (i.e. is not a leaf function) or that is going to modify a nonvolatile register should create a stack frame from memory. The stack frame is created by a function's prologue code and destroyed in its epilogue code. An example of function's prologue could be the following one

dosomething: mflr r0 ; Get Link register
stwu r1,-88(r1) ; Save Back chain and move SP
stw r0,+92(r1) ; Save Link register
stmw r28,+72(r1) ; Save 4 non-volatiles r28-r31
...

And here its epilogue

...
lwz r0,+92(r1) ; Get saved Link register
mtlr r0 ; Restore Link register
lmw r28,+72(r1) ; Restore non-volatiles
addi r1, r1,88 ; Remove frame from stack
bclr 20,0

Another Example

Let's see how the next C function, which swaps two floating points, and where pointers to float are passed as arguments, is resolved in Assembly.

void floatSwap(float* f1, float* f2){
float tmp;

tmp=*f1;
*f1=*f2;
*f2=tmp;
}

Looking at the assembly code generated by the compiler for this plain C function we have

0x00000140 <floatSwap+0>: or r11, r3, r3
0x00000144 <floatSwap+4>: or r12, r4, r4
0x00000148 <floatSwap+8>: lfs fr0, 0(r11)
0x0000014c <floatSwap+12>: lfs fr13, 0(r12)
0x00000150 <floatSwap+16>: stfs fr13, 0(r11)
0x00000154 <floatSwap+20>: stfs fr0, 0(r12)
0x00000158 <floatSwap+24>: bclr 20, 0
0x0000015c <floatSwap+28>: .long 0

Now I can build a two modules application to run the above function. It's enough a main that build an array of floating points to be swapped. The main should take the time spent for swapping a fixed amount of floating point data. Doing so, I can than compare the times for the same swapping function but written in assembly.

The main will have an extern declaration for the swapping function and a prototype for a function needed to show the time.

...

extern floatSwap(float *, float *);
static void show_times(int *,int *,char *,int);

Then a main body to create two arrays to be swapped.

...

int main(int argc, char ** argv){

float a[1024], b[1024];
...

for(i=0;i<1024;i++){
a[i]=(i+1)/(float)1000;
b[i]=-a[i];
}

for(i=0;i<10;i++)
printf("\n a[%d]=%f b[%d]=%f",i,a[i],i,b[i]);

GET_TIME(time_start[0]);

for(i=0;i<1024;i++)
floatSwap(&a[i],&b[i]);

GET_TIME(time_end[0]);

printf("\n ");

for(i=0;i<10;i++)
printf("\n a[%d]=%f b[%d]=%f",i,a[i],i,b[i]);

show_times(time_start, time_end," ",1);

return 0;
}

Running the application the output is

a[0]=0.001000 b[0]=-0.001000
a[1]=0.002000 b[1]=-0.002000
a[2]=0.003000 b[2]=-0.003000
a[3]=0.004000 b[3]=-0.004000
a[4]=0.005000 b[4]=-0.005000
a[5]=0.006000 b[5]=-0.006000
a[6]=0.007000 b[6]=-0.007000
a[7]=0.008000 b[7]=-0.008000
a[8]=0.009000 b[8]=-0.009000
a[9]=0.010000 b[9]=-0.010000

a[0]=-0.001000 b[0]=0.001000
a[1]=-0.002000 b[1]=0.002000
a[2]=-0.003000 b[2]=0.003000
a[3]=-0.004000 b[3]=0.004000
a[4]=-0.005000 b[4]=0.005000
a[5]=-0.006000 b[5]=0.006000
a[6]=-0.007000 b[6]=0.007000
a[7]=-0.008000 b[7]=0.008000
a[8]=-0.009000 b[8]=0.009000
a[9]=-0.010000 b[9]=0.010000 in 1 st pass : 25714 nanoseconds

Now just replacing the function performing the swap with and following one coded in assembly

.text
.global floatSwap
floatSwap:
or r8, r3,r3
or r9, r4,r4
lfs fr0,0(r8)
lfs fr13,0(r9)
stfs fr13,0(r8)
stfs fr0,0(r9)
bclr 20,0

Running this latter application, the output is

a[0]=0.001000 b[0]=-0.001000
a[1]=0.002000 b[1]=-0.002000
a[2]=0.003000 b[2]=-0.003000
a[3]=0.004000 b[3]=-0.004000
a[4]=0.005000 b[4]=-0.005000
a[5]=0.006000 b[5]=-0.006000
a[6]=0.007000 b[6]=-0.007000
a[7]=0.008000 b[7]=-0.008000
a[8]=0.009000 b[8]=-0.009000
a[9]=0.010000 b[9]=-0.010000

a[0]=-0.001000 b[0]=0.001000
a[1]=-0.002000 b[1]=0.002000
a[2]=-0.003000 b[2]=0.003000
a[3]=-0.004000 b[3]=0.004000
a[4]=-0.005000 b[4]=0.005000
a[5]=-0.006000 b[5]=0.006000
a[6]=-0.007000 b[6]=0.007000
a[7]=-0.008000 b[7]=0.008000
a[8]=-0.009000 b[8]=0.009000
a[9]=-0.010000 b[9]=0.010000 in 1 th pass : 25714 nanoseconds

Again the same results and with the same timing. Now I try to reduce the code just to see if I can speed up the code execution.

Assembly optimisation

The following assembly code is a quick and dirty version for the same function.

.text
.global floatSwap
floatSwap:
lfs fr0,0(r3)
lfs fr13,0(r4)
stfs fr13,0(r3)
stfs fr0,0(r4)
bclr 20,0

Actually the only thing I did is to remove the copy of the arguments, i.e. the floating point pointers, in the registers r8 and r9. I did it just to see some better performance. Here the results.

a[0]=0.001000 b[0]=-0.001000
a[1]=0.002000 b[1]=-0.002000
a[2]=0.003000 b[2]=-0.003000
a[3]=0.004000 b[3]=-0.004000
a[4]=0.005000 b[4]=-0.005000
a[5]=0.006000 b[5]=-0.006000
a[6]=0.007000 b[6]=-0.007000
a[7]=0.008000 b[7]=-0.008000
a[8]=0.009000 b[8]=-0.009000
a[9]=0.010000 b[9]=-0.010000

a[0]=-0.001000 b[0]=0.001000
a[1]=-0.002000 b[1]=0.002000
a[2]=-0.003000 b[2]=0.003000
a[3]=-0.004000 b[3]=0.004000
a[4]=-0.005000 b[4]=0.005000
a[5]=-0.006000 b[5]=0.006000
a[6]=-0.007000 b[6]=0.007000
a[7]=-0.008000 b[7]=0.008000
a[8]=-0.009000 b[8]=0.009000
a[9]=-0.010000 b[9]=0.010000 in 1 th pass : 23308 nanoseconds

Notes

There are some notes it is worth mentioning here. First the name of the registers could be different depending on the Assembler used. For instance it is possible to find out fp or fr as prefix for the floating point registers. To find out how your compiler and assembler works, a quick way is to build a simple function in C and then to have a look at the disassembled code using gdb or any other debugger tools. By the way, remember that assembly is a language while assembler is the tool that translate assembly code in machine code.

A note just for those who where used to work with the Motorola 68K. Now it's more like the Intel style: first comes the destination and then the source.

Assembly Code Parallelisation

A note about the optimisation. In the processor there are independent units. For instance the Integer units and the floating point units are independent. In case you have some code such as

lfsx fr0, r7,r20 ; load 1st FP32 value
lfsx fr12, r7,r21 ; load 2nd one
stfsx fr12, r7,r20 ; store latter value on 1st addr
stfsx fr0, r7,r21 ; store former value on 2nd addr

mullw r15, r9, r3
divw r16,r15, r4
mullw r16,r16, r4

you can do some optimisation considering that there is a part of the code (the first 4 lines) that is handling floating point values while the last free lines are working on general purpose registers. Moreover the results from the first 4 is not needed to the other lines. Having a floating point unit working independently from the integer unit, I can mix the line to make the two units working in parallel. In this way I speed up the code.

lfsx fr0, r7,r20 ; load 1st FP32 value
mullw r15, r9, r3
lfsx fr12, r7,r21 ; load 2nd one
divw r16,r15, r4
stfsx fr12, r7,r20 ; store latter value on 1st addr
mullw r16,r16, r4
stfsx fr0, r7,r21 ; store former value on 2nd addr

...TO BE FINISHED: WORK IN PROGRESS...