5.2: The Optimal Solution
5.2: The Optimal Solution
The choice of the algorithm depends on the real time requirements and on the DSP features: when the DSP is able to perform both a multiplication or an addition within the same CPU time, as in our case, it is not necessary to adopt the Rader Brenner solution (see table 5.1.) In fact, in this latter case, the total number of operations is higher than the one for the optimised Radix-2 solution.
In addition, the solution proposed by Winograd [4] has been rejected mainly for two reasons: first because the idea behind the algorithm is to reduce the number of multiplications to the detriment of the number of additions, which grows. This could be useful in case of a typical General Purpose processor (CISC) where generally a multiplication requires more cycles than the sum. On the contrary, looking at the table 5.1, you can see that the total number of operations (additions and products together) obtained with Winegard, is higher than the optimised Radix-2 solution. Second, the Winograd solution uses a recursive structure. This leads to a different memory management not as suitable for our pipelined DSP as for a CISC machine: the frequently function calls could prevent the CPU from a proper and efficient use of the pipeline, and the resulting code could be reach of “not operation” (nop) instructions. Such situation results in a low DSP performance.
We have also rejected the Radix 4 and 8 algorithms because we chose to handle the DSP memory using blocks of 1024 word1. The use of Radix 4 and 8 algorithms, with N = 1024, let you compute FFTs only for sequences with a number of points that is only power of four or 8. In this case, you have to implement different algorithms depending on the window width of the data to be analysed.
An attractive solution is the algorithm that considers a generic number of data, and processes the data in a packed form. Unfortunately, also its structure is strictly dependent on the samples number, resulting in a less general and less flexible solution. It is different the case in which the number of input data is a power of 2; in fact, these algorithms let you to calculate the results on site, that is in the same memory location where the original input data are located without the need to move the data itself.
Somehow, also the optimised Radix 4 algorithm, as suggested by the Bellinger, is not recommended. In fact, even if it is very efficient (about 25% more versus the Radix 2,) nevertheless it requires a considerable space of memory to store the matrices obtained from the Kronekher factorisation. In this case, we would not be able to load these matrices in the ONCHIP memory of the DSP or even elsewhere on the DSP.
1A word in our DSP and in the Macintosh environment is 32 bits long.