Loop carried dependencies: OpenMP’s Kryptonite?

Over the past few weeks I’ve been going through some parallel computing (or OpenMP) blogs and I’ve noticed that not a lot of them discuss how OpenMP can be used to tackle loop carried dependencies. Beginners (or at-least that’s what I infer they are from the looks of their code) tend to either skip parallelizing such loops or avoid them all together in their code. Hence, I thought it would be a good idea to write an article describing how such loops can be tackled with OpenMP and show how easy it is to do so.

I’ve picked a simple for loop for this exercise. Let’s talk about the raise in the salary of an employee (apologies for my lack of creativity 😛 ). The following code snippet depicts this:

double multiplier = 1.5;
double salary = 100.0;
double raise[N+1];
for(int i=0; i<=N; ++i)
{
raise[i] = salary;
salary *= multiplier;
}

Notice that each iteration of the loop depends on the value of the salary variable from the previous iteration. So now comes the question, can OpenMP be used to parallelize this loop? Or is this where the power of OpenMP is no longer of any use to us? Let’s find out.

On close examination of this code snippet, we see that the loop dependency is breakable. It’s a pretty straightforward solution. We can make use of OpenMP’s private and last private clauses to break the dependency. Here’s the snippet after the modification followed by an explanation:
double multiplier = 1.5 ;
double Salary, origSalary =100.0;
double raise[N+1];
int n;
raise[0] = origSalary;
#pragma omp parallel for private(n) lastprivate(Salary)
for (n=1; n<=N; ++n)
{
Salary = origSalary * pow(multiplier, n);
raise[n] = Salary;
}
Salary *= multiplier;

Voila! The loop with a dependency has been parallelized. Well, sorta. I added a new variable into the code and introduced quite an expensive operation. But hey, I parallelized it.  OpenMP’s private and lastprivate clauses are used to specify variable properties within each thread. The private clause in front of a variable indicates that each thread gets a unique memory address of where to store values for that variable while in the parallel region. When the parallel region ends, the memory assigned is freed and these variables no longer exist. When it comes to the lastprivate clause, the last value of the variable is preserved, i.e., the thread that executes the ending loop index copies its value onto the master thread.

Now let us look a more efficient implementation of the same code:

double multiplier = 1.5;
double Salary, origSalary = 100.0;
double raise[N+1];
int n;
int prevn = -2;
#pragma omp parallel for private(n) firstprivate(prevn) lastprivate(Salary) schedule(static, 1)
for (n=0; n<=N; ++n)
{
if(prevn != n-1)
{
Salary = origSalary * pow(multiplier, n);
}
else
{
Salary *= multiplier;
}
raise[n] = Salary;
prevn = n;
}
Salary *= multiplier;

By explicitly keeping track of the previous value of n, we avoid recalculating the power function on each iteration which saves us a lot of time. This implementation also divides the iteration space into chunks. The firstprivate clause is another similar clause. It specifies that each thread should have its own instance of a variable and that the variable should be initialized with the value assigned to the variable before the parallel construct. I have spoken about scheduling in my previous post.

Hence, the loop with a carried dependency has successfully been parallelized. Or, not? On running these three snippets along with some supporting code and with N values around 100000, we see the following output. Serial, para1 and para2 refer to the first, second and third code snippet respectively.

STATS

Well, oops. Turns out that the parallel codes take a lot longer than the serial code. Well, at least the optimization worked fine. But the parallel code still takes a lot longer than its serial counterpart. The increase in the time of parallel code results due to the overhead in creating the threads. This coupled with the expensive power function leads to the jump in the time taken. I suspect this would change as the input gets a lot larger.

Conclusion

Loop carried dependencies up against parallelism may depict an unstoppable force going against an immovable object. But, sometimes there are workarounds which will get the job done. However, that alone will not guarantee us good performance. Hence, what we learnt today is, sometimes the best optimization there is when it comes to parallelism is … to not use it. I feel this is a pretty decent lesson to learn. Until next time then.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s