Making Fast Fourier Transforms a lot faster

During my advanced algorithms course last semester, I learnt about fast Fourier transforms and how they can be implemented recursively. Fast Fourier transforms are used to compute the discrete Fourier transform of a sequence by taking advantage of the special properties of the complex roots of unity which makes the process a lot faster than the straight forward method of computing the DFT of a sequence.

The approach is quite simple. I’ve followed the algorithm provided in the book Introduction to Algorithms, CLRS ( which is a must-read book if you’re an algorithms enthusiast like me! ). I’ve implemented the algorithm in C. The only new thing I learnt about was the ‘complex’ data type that C offers. The code can be found here.

Coming to the parallel implementation of the same code, since I was recently introduced to the OpenMP library, I thought why not try experimenting with it. The first step would be to profile the code and look for the most troublesome (potentially parallelizable) parts of the code. The GNU Gprof tool is a very useful profiling tool that helps us analyse different part of our code and provides statistics about how much time each part took to execute for the input provided. After profiling my serial code, here’s what I got:

profile

As we see, the code spends a lot of time in the fastFourierTrans function ( almost 60%! ). The other regions of the code have a negligible amount of time spent when compared to this. Hence, we look at the code within this function and look for potentially parallelizable snippets. This is the function:

dcom * fastFourierTrans(dcom * inputArray, int len)
{

if(len == 1){
return inputArray;
}

dcom omega = 1;
dcom omegaN = cexp(2*PI*I/len); //the fourier coefficient

dcom * evenSubArray = fastFourierTrans(split_array(inputArray, len, 0), len/2);
dcom * oddSubArray = fastFourierTrans(split_array(inputArray, len, 1), len/2);

dcom * outputArray = malloc(len*sizeof(dcom));

for(int i=0; i<(len/2); i++)
{
outputArray[i] = evenSubArray[i] + omega * oddSubArray[i];
outputArray[i+(len/2)] = evenSubArray[i] - omega * oddSubArray[i];
omega = omegaN*omega;
}

free(evenSubArray);
free(oddSubArray);
return outputArray;
}

We notice that the two recursive calls being made are independent of each other and can be performed in parallel. For this we make use of the OMP SECTIONS construct of OpenMP. The sections construct distributes the blocks/tasks between existing threads. The requirement is that each block must be independent of the other blocks. Then each thread executes one block at a time. Each block is executed only once by one thread. We specify this as:
#pragma omp parallel sections shared(inputArray) firstprivate(len)
{
#pragma omp section
{
evenSubArray = fastFourierTrans(split_array(inputArray, len, 0), len/2);
}
#pragma omp section
{
oddSubArray = fastFourierTrans(split_array(inputArray, len, 1),     len/2);
}
}

The shared clause of OpenMP is used to indicate that that particular variable is to be shared among the threads created. The firstprivate clause specifies that each thread should have its own instance of a variable and that the variable should be initialised with the value assigned to the variable before the parallel construct which in this case is the value of len passed as an argument.

Looking for other regions, we notice that the for loop used to stitch together the odd and the even subarrays consists of iterations that are independent of each other. Woohoo! Another bit of the code is parallelizable! We use the OMP PARALLEL FOR construct here to parallelize the for loop as:

#pragma omp parallel for schedule(static)
for(int i=0; i<(len/2); i++)
{
outputArray[i] = evenSubArray[i] + omega * oddSubArray[i];
outputArray[i+(len/2)] = evenSubArray[i] - omega * oddSubArray[i];
omega = omegaN*omega;
}

The schedule clause tells the compiler how the iterations are to be split among the threads available. There are three types of scheduling available: static, dynamic and guided. The schedule clause also takes a chunk size parameter whose default value is 1. Loop iterations are divided into these chunks. I’ll briefly indicate what these are:

Static scheduling:  The chunks mentioned are allocated to each thread statically. If a chunk size isn’t mentioned, the iterations are divided contiguously if possible among the threads.

Dynamic scheduling: The chunks are dynamically allocated to threads, i.e., once a thread finishes its execution, it is assigned another chunk dynamically.

Guided scheduling: Loop iterations are grouped into blocks and are assigned to threads dynamically until no more blocks are left. The block size is decreased each time a block is assigned to a thread.

Ideally, dynamic and guided scheduling should be a lot faster than static scheduling. But, when I ran the code on my system, they took slightly longer than the statically scheduled code. Here’s the output from my terminal:

SCHED

 

The reason why static scheduling works faster here is cause the workload of each thread for the specific computation is the same across threads. Dynamic and guided scheduling work a lot better when the workload that is distributed is pretty irregular. Dynamic and guided also add some extra overhead which may slow things down. Hence, for these inputs, I have stuck with static scheduling.

Now that we have successfully parallelized the function, let’s look at the results. I compare the results of the serial code against the parallel one on inputs of varying sizes.

Experiment details:

Processor: Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz
OS : Ubuntu 16.04
Cores and threads: 4 cores and 8 threads
Input size: Varying size integer arrays of elements of type int between [-1000, 1000]
Time taken is the average of three executions
Input Size Serial Code Parallel Code
1048576 Execution Time(ms) Real User Sys Execution Time(ms) Real User Sys
692.405687 0m0.789s 0m0.708s 0m0.080s 1029.884810 0m1.244s 0m1.944s 0m0.408s
4194304 Execution Time(ms) Real User Sys Execution Time(ms) Real User Sys
2891.290853 0m3.264s 0m2.980s 0m0.280s 3713.782196 0m4.101s 0m6.496s 0m1.600s
8388608 Execution Time(ms) Real User Sys Execution Time(ms) Real User Sys
5937.679018 0m6.741s 0m6.188s 0m0.552s 7240.522231 0m8.013s 0m12.808s 0m3.012s
16777216 Execution Time(ms) Real User Sys Execution Time(ms) Real User Sys
81532.671494 1m24.029s 0m13.180s 0m1.956s 21720.236927 0m29.189s 0m27.300s 0m6.604s

Results:

Initially, we notice that the parallel code takes slightly longer than the serial code. This is due to the fact that the overhead of creating the threads for such inputs is quite substantial which can be inferred by noticing that the time spent in the user space is almost equal to the time spent in executing. As the input increases in size, say size pow(2, 24)  as in my last input, the time reduces by a large factor (almost by 4 times! ). It is at this point that the power of parallel code kicks in. Here we can see that the time spent in the user space is a lot lesser comparatively. I was limited to this input size as any further size increase led to my system behaving abnormally.

I’ve realised that OpenMP has simple constructs that most beginners can comfortably get accustomed to using. It also has a wide community which makes it an ideal choice for beginners to practice parallel computing with. The

The parallel code for the same can be found here. Until next time then.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s