Loop carried dependencies: OpenMP’s Kryptonite?

Over the past few weeks I’ve been going through some parallel computing (or OpenMP) blogs and I’ve noticed that not a lot of them discuss how OpenMP can be used to tackle loop carried dependencies. Beginners (or at-least that’s what I infer they are from the looks of their code) tend to either skip parallelizing such loops or avoid them all together in their code. Hence, I thought it would be a good idea to write an article describing how such loops can be tackled with OpenMP and show how easy it is to do so.

I’ve picked a simple for loop for this exercise. Let’s talk about the raise in the salary of an employee (apologies for my lack of creativity ūüėõ ). The following code snippet depicts this:

double multiplier = 1.5;
double salary = 100.0;
double raise[N+1];
for(int i=0; i<=N; ++i)
{
raise[i] = salary;
salary *= multiplier;
}

Notice that each iteration of the loop depends on the value of the salary variable from the previous iteration. So now comes the question, can OpenMP be used to parallelize this loop? Or is this where the power of OpenMP is no longer of any use to us? Let’s find out.

On close examination of this code snippet, we see that the loop dependency is breakable. It’s a pretty straightforward solution. We can make use of OpenMP’s private and last private clauses to break the dependency. Here’s the snippet after the modification followed by an explanation:
double multiplier = 1.5 ;
double Salary, origSalary =100.0;
double raise[N+1];
int n;
raise[0] = origSalary;
#pragma omp parallel for private(n) lastprivate(Salary)
for (n=1; n<=N; ++n)
{
Salary = origSalary * pow(multiplier, n);
raise[n] = Salary;
}
Salary *= multiplier;

Voila! The loop with a dependency has been parallelized. Well, sorta. I added a new variable into the code and introduced quite an expensive operation. But hey, I parallelized it. ¬†OpenMP’s private and lastprivate clauses are used to specify variable properties within each thread. The private clause in front of a variable indicates that each thread gets a unique memory address of where to store values for that variable while in the parallel region. When the parallel region ends, the memory assigned is freed and these variables no longer exist. When it comes to the lastprivate clause, the last value of the variable is preserved, i.e., the thread that executes the ending loop index copies its value onto the master thread.

Now let us look a more efficient implementation of the same code:

double multiplier = 1.5;
double Salary, origSalary = 100.0;
double raise[N+1];
int n;
int prevn = -2;
#pragma omp parallel for private(n) firstprivate(prevn) lastprivate(Salary) schedule(static, 1)
for (n=0; n<=N; ++n)
{
if(prevn != n-1)
{
Salary = origSalary * pow(multiplier, n);
}
else
{
Salary *= multiplier;
}
raise[n] = Salary;
prevn = n;
}
Salary *= multiplier;

By explicitly keeping track of the previous value of n, we avoid recalculating the power function on each iteration which saves us a lot of time. This implementation also divides the iteration space into chunks. The firstprivate clause is another similar clause. It specifies that each thread should have its own instance of a variable and that the variable should be initialized with the value assigned to the variable before the parallel construct. I have spoken about scheduling in my previous post.

Hence, the loop with a carried dependency has successfully been parallelized. Or, not? On running these three snippets along with some supporting code and with N values around 100000, we see the following output. Serial, para1 and para2 refer to the first, second and third code snippet respectively.

STATS

Well, oops. Turns out that the parallel codes take a lot longer than the serial code. Well, at least the optimization worked fine. But the parallel code still takes a lot longer than its serial counterpart. The increase in the time of parallel code results due to the overhead in creating the threads. This coupled with the expensive power function leads to the jump in the time taken. I suspect this would change as the input gets a lot larger.

Conclusion

Loop carried dependencies up against parallelism may depict an unstoppable force going against an immovable object. But, sometimes there are workarounds which will get the job done. However, that alone will not guarantee us good performance. Hence, what we learnt today is, sometimes the best optimization there is when it comes to parallelism is … to not use it. I feel this is a pretty decent lesson to learn. Until next time then.

Advertisements

Making Fast Fourier Transforms a lot faster

During my advanced algorithms course last semester, I learnt about fast Fourier transforms and how they can be implemented recursively. Fast Fourier transforms are used to compute the discrete Fourier transform of a sequence by taking advantage of the special properties of the complex roots of unity which makes the process a lot faster than the straight forward method of computing the DFT of a sequence.

The approach is quite simple. I’ve followed the algorithm provided in the book Introduction to Algorithms, CLRS ( which is a must-read book if you’re an algorithms enthusiast like me! ). I’ve implemented the algorithm in C. The only new thing I learnt about was the ‘complex’ data type that C offers. The code can be found here.

Coming to the parallel implementation of the same code, since I was recently introduced to the OpenMP library, I thought why not try experimenting with¬†it. The first step would be to profile the code and look for the most troublesome (potentially parallelizable) parts of the code. The GNU Gprof tool is a very useful profiling tool that helps us analyse different part of our code and provides statistics about how much time each part took to execute for the input provided. After profiling my serial code, here’s what I got:

profile

As we see, the code spends a lot of time in the fastFourierTrans function ( almost 60%! ). The other regions of the code have a negligible amount of time spent when compared to this. Hence, we look at the code within this function and look for potentially parallelizable snippets. This is the function:

dcom * fastFourierTrans(dcom * inputArray, int len)
{

if(len == 1){
return inputArray;
}

dcom omega = 1;
dcom omegaN = cexp(2*PI*I/len); //the fourier coefficient

dcom * evenSubArray = fastFourierTrans(split_array(inputArray, len, 0), len/2);
dcom * oddSubArray = fastFourierTrans(split_array(inputArray, len, 1), len/2);

dcom * outputArray = malloc(len*sizeof(dcom));

for(int i=0; i<(len/2); i++)
{
outputArray[i] = evenSubArray[i] + omega * oddSubArray[i];
outputArray[i+(len/2)] = evenSubArray[i] - omega * oddSubArray[i];
omega = omegaN*omega;
}

free(evenSubArray);
free(oddSubArray);
return outputArray;
}

We notice that the two recursive calls being made are independent of each other and can be performed in parallel. For this we make use of the OMP SECTIONS construct of OpenMP. The sections construct distributes the blocks/tasks between existing threads. The requirement is that each block must be independent of the other blocks. Then each thread executes one block at a time. Each block is executed only once by one thread. We specify this as:
#pragma omp parallel sections shared(inputArray) firstprivate(len)
{
#pragma omp section
{
evenSubArray = fastFourierTrans(split_array(inputArray, len, 0), len/2);
}
#pragma omp section
{
oddSubArray = fastFourierTrans(split_array(inputArray, len, 1),     len/2);
}
}

The shared clause of OpenMP is used to indicate that that particular variable is to be shared among the threads created. The firstprivate clause specifies that each thread should have its own instance of a variable and that the variable should be initialised with the value assigned to the variable before the parallel construct which in this case is the value of len passed as an argument.

Looking for other regions, we notice that the for loop used to stitch together the odd and the even subarrays consists of iterations that are independent of each other. Woohoo! Another bit of the code is parallelizable! We use the OMP PARALLEL FOR construct here to parallelize the for loop as:

#pragma omp parallel for schedule(static)
for(int i=0; i<(len/2); i++)
{
outputArray[i] = evenSubArray[i] + omega * oddSubArray[i];
outputArray[i+(len/2)] = evenSubArray[i] - omega * oddSubArray[i];
omega = omegaN*omega;
}

The schedule clause tells the compiler how the iterations are to be split among the threads available. There are three types of scheduling available: static, dynamic and guided. The schedule clause also takes a chunk size parameter whose default value is 1. Loop iterations are divided into these chunks. I’ll briefly indicate what these are:

Static scheduling: ¬†The chunks mentioned are allocated to each thread statically. If a chunk size isn’t mentioned, the iterations are divided contiguously if possible among the threads.

Dynamic scheduling: The chunks are dynamically allocated to threads, i.e., once a thread finishes its execution, it is assigned another chunk dynamically.

Guided scheduling: Loop iterations are grouped into blocks and are assigned to threads dynamically until no more blocks are left. The block size is decreased each time a block is assigned to a thread.

Ideally, dynamic and guided scheduling should be a lot faster than static scheduling. But, when I ran the code on my system, they took slightly longer than the statically scheduled code. Here’s the output from my terminal:

SCHED

 

The reason why static scheduling works faster here is cause the workload of each thread for the specific computation is the same across threads. Dynamic and guided scheduling work a lot better when the workload that is distributed is pretty irregular. Dynamic and guided also add some extra overhead which may slow things down. Hence, for these inputs, I have stuck with static scheduling.

Now that we have successfully parallelized the function, let’s look at the results. I compare the results of the serial code against the parallel one on inputs of varying sizes.

Experiment details:

Processor: Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz
OS : Ubuntu 16.04
Cores and threads: 4 cores and 8 threads
Input size: Varying size integer arrays of elements of type int between [-1000, 1000]
Time taken is the average of three executions
Input Size Serial Code Parallel Code
1048576 Execution Time(ms) Real User Sys Execution Time(ms) Real User Sys
692.405687 0m0.789s 0m0.708s 0m0.080s 1029.884810 0m1.244s 0m1.944s 0m0.408s
4194304 Execution Time(ms) Real User Sys Execution Time(ms) Real User Sys
2891.290853 0m3.264s 0m2.980s 0m0.280s 3713.782196 0m4.101s 0m6.496s 0m1.600s
8388608 Execution Time(ms) Real User Sys Execution Time(ms) Real User Sys
5937.679018 0m6.741s 0m6.188s 0m0.552s 7240.522231 0m8.013s 0m12.808s 0m3.012s
16777216 Execution Time(ms) Real User Sys Execution Time(ms) Real User Sys
81532.671494 1m24.029s 0m13.180s 0m1.956s 21720.236927 0m29.189s 0m27.300s 0m6.604s

Results:

Initially, we notice that the parallel code takes slightly longer than the serial code. This is due to the fact that the overhead of creating the threads for such inputs is quite substantial which can be inferred by noticing that the time spent in the user space is almost equal to the time spent in executing. As the input increases in size, say size pow(2, 24)  as in my last input, the time reduces by a large factor (almost by 4 times! ). It is at this point that the power of parallel code kicks in. Here we can see that the time spent in the user space is a lot lesser comparatively. I was limited to this input size as any further size increase led to my system behaving abnormally.

I’ve realised that OpenMP has simple constructs that most beginners can comfortably get accustomed to using. It also has a wide community which makes it an ideal choice for beginners to practice parallel computing with. The

The parallel code for the same can be found here. Until next time then.

Parallelized Strassen’s & Divide and Conquer for Matrix Multiplication

Problem Statement: To get the product¬†of two matrices having elements of the data type float, using both Strassen’s algorithm as well as the Divide and Conquer algorithm.¬†

Aim: To parallelize both the above mentioned algorithms using POSIX threads and check their improvements over the serial version. 

Details: The code has been written by me in C language and both the algorithms have been implemented by following the book, Introduction to Algorithms (CLRS).

The divide and conquer approach is an algorithm design paradigm which can be used to perform matrix multiplication. The approach taken can be analysed in the code written by me.

Strassen’s algorithm is a pretty smart algorithm which performs the same operation. It’s a lot faster than the straightforward multiplication algorithm (¬†ijk multiplication) for large matrices.

My code for the serial implementation of divide and conquer can be found here and the serial implementation of Strassen’s can be found here.

Hotpots in the code:

After analysing each of the serial codes, it is very apparent that the parallelizable hotspots are the functions which are recursively called. And hence, I have gone after these functions with the power of POSIX threads.

Approach: 

POSIX threads or Pthreads are created using the pthread_create() function. This function takes in the thread ID, thread attributes (default attributes taken if this field is NULL), the routine to be executed by the thread and a parameter to that routine.
The created threads are collected by the master program by issuing the pthread_join() which takes in the ID of the thread to be collected and a non-null pointer variable in which the return value of the thread gets stored.
The routine to be executed by the threads accepts a single parameter of type void and returns a parameter of type void. Hence, in order to convert my serial recursive functions into such a routine, I had to create explicit structures that can store multiple matrices. These structures are defined as mat ( Holds two matrices and their dimension) and matS (Holds a single matrix and its dimension). Pairs of matrices (these pairs are decided based on how they are grouped in the algorithm for specific operations) are stored within the mat structure and type-cast to void pointers before being passed to the thread routine through the pthread_create call. The thread routines are recursive functions and perform the necessary division of the elements passed to them and call themselves within until the base condition is hit. The resultant matrix returned by each thread routine is type-cast back to the appropriate structure in the call to pthread_join and used for further operations. Any additional details required can be found within the comments present in the code.

Strassen’s: The seven intermediate multiplications present in Strassen’s algorithm are split up and performed by individual threads. Each thread is collected in the same sequence in which the threads were generated and return the intermediate products. These products are then used in the formulae present in the algorithm to generate the final product. To achieve this implementation:

The function void * remul(void * mat) is defined as the thread routine. The appropriate pairs matrices are grouped together by following the algorithm and passed to the thread routine by a call to
pthread_create(&tid[0], NULL, remul, (void*) m1);

The function calls are replaced in the serial code accordingly. For example, the function call in the serial code
float ** P1 = multiply(A11, S1, N);
is converted into
matS * p1 = remul((void *) m1);

The remul function returns a void pointer to the resultant matrix which is then type-cast back the appropriate structure type (matS ) using a call to
pthread_join(tid[0], &stat1);
where stat1 now points to the resultant matrix from the executed routine of the thread (with ID 0 in this case) collected.

Divide & Conquer: As per the algorithm, the multiplication is split into 8 sub-multiplications. Each of these multiplications are treated as separate tasks and shared among individual threads. The intermediate products are then used to generate the final product. To achieve this implementation:

The function void * remul(void * M) is defined as the thread routine. As in the case of Strassen’s,the appropriate matrices are paired together by following the algorithm and passed to the thread routine by a call to
pthread_create(&tid[0], NULL, remul, (void*) m1);

The function calls are replaced in the serial code accordingly. For example, the function call in the serial code
float ** t1 = multiply(A11, B11, (N));
is converted into
matS * t1 = remul((void *) m1);

As with Strassen’s, the remul function returns a pointer to the resultant matrix which is then type-cast back the appropriate structure type (matS) using a call to pthread_join(tid[0], &stat1);
where stat1 points to the resultant matrix from the executed routine of the thread collected.

The rest of the code follows the serial algorithm to get the job done. I have executed and compared the code with multiple input sizes. The experiment details and test results are specified below.

Experiment specifications: 

Processor: Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz
OS : Ubuntu 16.04
Cores and threads: 4 cores and 8 threads
Input size: 256 x 256 square matrix of elements of type float
Time taken is the average of three executions

Results:

Algorithm Divide and Conquer Strassen’s
Serial Code Clock Time Real User Sys Clock Time Real User Sys
2966267 0m2.995s 0m2.992s 0m0.000s 2099798 0m2.175s 0m2.172s 0m0.000s
Parallel Code Clock Time Real User Sys Clock Time Real User Sys
14337583 0m2.016s 0m13.292s 0m0.184s 8848305 0m1.622s 0m7.272s 0m1.660s

Observations:

  • Unfortunately, the parallel versions of both the approaches took a lot longer than their serial counterparts.
  • In case of the divide and conquer approach, we see a decrease of performance as the execution time increases to 5 times the serial implementations’ execution time and in the case of Strassen’s algorithm,a decrease of performance as the execution time increases to 4 times the serial implementations’ execution time .

Why could this have happened? 

After reading through a couple of resources, I found out that creating and collecting threads sequentially may lead to particular threads having to wait too long before their collection or may even lead to deadlock conditions.

The pthread_join function for threads is equivalent of wait() for processes. As I call the multiple pthread_join functions in sequence, specifying the thread number that I’m expecting, the program may end up waiting for a particular thread for too long thereby hindering the already returned threads from being joined. For example, say I create 8 threads with¬†their respective thread IDs starting from 1 and going up to 8. ¬†On a quad core machine, I can ideally create a maximum of only 8 threads. Say that all 8 threads are fired one after the other. Due to either race conditions or scheduling paradigms, some threads may return a lot faster than the others. For instance, threads 1 through 4 and 6 through 8 may return together taking some particular amount of time. And thread 5 returns after taking double the amount of the same time. Since I call pthread_join sequentially, the call to pthread_join(tid[5], &stat6); (which collects the 5th threads’ return value) halts the execution of the master program until thread 5 finishes its execution. Hence, even though threads 6 through 8 have returned, the master program cannot collect their return value until it collects thread number fives’ return value. This phenomenon has a major negative impact on the performance of the program. Hence, we see the ugly numbers in the table present above.

One way to avoid this problem would be to keep track of which threads have finished executing, and collect only those. This can be implemented by making each thread explicitly signal the master thread when done, so that the master thread can issue a join call only to that particular thread. Another workaround I found to this issue is to program the threads to update a global data structure (which in this case would the product matrix). This way, we¬†wouldn’t need¬†to call the pthread_join function as the¬†threads wouldn’t be returning any value. I feel that this would be the fastest approach. Hope this helped.

Github code for parallel divide and conquer can be found here.

Github code for parallel strassen’s can be found here.

Until next time.

The who, the what and the why?

Hello there!

I’m Supreeth, currently pursuing a Computer Science major at PES University. This blog reflects my thoughts, interests and experiments as I proceed through the Parallel Computing course at my university. Through this blog, I hope to share and gain knowledge while having fun doing so. Let me share a quote with you, which we will soon prove wrong.

‚ÄúWe stand at the threshold of a many core world. The hardware community is ready to cross this threshold. The parallel software community is not.‚ÄĚ Tim Mattson, principal engineer at Intel.

Let’s begin, shall we.