Parallelized Strassen’s & Divide and Conquer for Matrix Multiplication

Problem Statement: To get the product of two matrices having elements of the data type float, using both Strassen’s algorithm as well as the Divide and Conquer algorithm. 

Aim: To parallelize both the above mentioned algorithms using POSIX threads and check their improvements over the serial version. 

Details: The code has been written by me in C language and both the algorithms have been implemented by following the book, Introduction to Algorithms (CLRS).

The divide and conquer approach is an algorithm design paradigm which can be used to perform matrix multiplication. The approach taken can be analysed in the code written by me.

Strassen’s algorithm is a pretty smart algorithm which performs the same operation. It’s a lot faster than the straightforward multiplication algorithm ( ijk multiplication) for large matrices.

My code for the serial implementation of divide and conquer can be found here and the serial implementation of Strassen’s can be found here.

Hotpots in the code:

After analysing each of the serial codes, it is very apparent that the parallelizable hotspots are the functions which are recursively called. And hence, I have gone after these functions with the power of POSIX threads.

Approach: 

POSIX threads or Pthreads are created using the pthread_create() function. This function takes in the thread ID, thread attributes (default attributes taken if this field is NULL), the routine to be executed by the thread and a parameter to that routine.
The created threads are collected by the master program by issuing the pthread_join() which takes in the ID of the thread to be collected and a non-null pointer variable in which the return value of the thread gets stored.
The routine to be executed by the threads accepts a single parameter of type void and returns a parameter of type void. Hence, in order to convert my serial recursive functions into such a routine, I had to create explicit structures that can store multiple matrices. These structures are defined as mat ( Holds two matrices and their dimension) and matS (Holds a single matrix and its dimension). Pairs of matrices (these pairs are decided based on how they are grouped in the algorithm for specific operations) are stored within the mat structure and type-cast to void pointers before being passed to the thread routine through the pthread_create call. The thread routines are recursive functions and perform the necessary division of the elements passed to them and call themselves within until the base condition is hit. The resultant matrix returned by each thread routine is type-cast back to the appropriate structure in the call to pthread_join and used for further operations. Any additional details required can be found within the comments present in the code.

Strassen’s: The seven intermediate multiplications present in Strassen’s algorithm are split up and performed by individual threads. Each thread is collected in the same sequence in which the threads were generated and return the intermediate products. These products are then used in the formulae present in the algorithm to generate the final product. To achieve this implementation:

The function void * remul(void * mat) is defined as the thread routine. The appropriate pairs matrices are grouped together by following the algorithm and passed to the thread routine by a call to
pthread_create(&tid[0], NULL, remul, (void*) m1);

The function calls are replaced in the serial code accordingly. For example, the function call in the serial code
float ** P1 = multiply(A11, S1, N);
is converted into
matS * p1 = remul((void *) m1);

The remul function returns a void pointer to the resultant matrix which is then type-cast back the appropriate structure type (matS ) using a call to
pthread_join(tid[0], &stat1);
where stat1 now points to the resultant matrix from the executed routine of the thread (with ID 0 in this case) collected.

Divide & Conquer: As per the algorithm, the multiplication is split into 8 sub-multiplications. Each of these multiplications are treated as separate tasks and shared among individual threads. The intermediate products are then used to generate the final product. To achieve this implementation:

The function void * remul(void * M) is defined as the thread routine. As in the case of Strassen’s,the appropriate matrices are paired together by following the algorithm and passed to the thread routine by a call to
pthread_create(&tid[0], NULL, remul, (void*) m1);

The function calls are replaced in the serial code accordingly. For example, the function call in the serial code
float ** t1 = multiply(A11, B11, (N));
is converted into
matS * t1 = remul((void *) m1);

As with Strassen’s, the remul function returns a pointer to the resultant matrix which is then type-cast back the appropriate structure type (matS) using a call to pthread_join(tid[0], &stat1);
where stat1 points to the resultant matrix from the executed routine of the thread collected.

The rest of the code follows the serial algorithm to get the job done. I have executed and compared the code with multiple input sizes. The experiment details and test results are specified below.

Experiment specifications: 

Processor: Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz
OS : Ubuntu 16.04
Cores and threads: 4 cores and 8 threads
Input size: 256 x 256 square matrix of elements of type float
Time taken is the average of three executions

Results:

Algorithm Divide and Conquer Strassen’s
Serial Code Clock Time Real User Sys Clock Time Real User Sys
2966267 0m2.995s 0m2.992s 0m0.000s 2099798 0m2.175s 0m2.172s 0m0.000s
Parallel Code Clock Time Real User Sys Clock Time Real User Sys
14337583 0m2.016s 0m13.292s 0m0.184s 8848305 0m1.622s 0m7.272s 0m1.660s

Observations:

  • Unfortunately, the parallel versions of both the approaches took a lot longer than their serial counterparts.
  • In case of the divide and conquer approach, we see a decrease of performance as the execution time increases to 5 times the serial implementations’ execution time and in the case of Strassen’s algorithm,a decrease of performance as the execution time increases to 4 times the serial implementations’ execution time .

Why could this have happened? 

After reading through a couple of resources, I found out that creating and collecting threads sequentially may lead to particular threads having to wait too long before their collection or may even lead to deadlock conditions.

The pthread_join function for threads is equivalent of wait() for processes. As I call the multiple pthread_join functions in sequence, specifying the thread number that I’m expecting, the program may end up waiting for a particular thread for too long thereby hindering the already returned threads from being joined. For example, say I create 8 threads with their respective thread IDs starting from 1 and going up to 8.  On a quad core machine, I can ideally create a maximum of only 8 threads. Say that all 8 threads are fired one after the other. Due to either race conditions or scheduling paradigms, some threads may return a lot faster than the others. For instance, threads 1 through 4 and 6 through 8 may return together taking some particular amount of time. And thread 5 returns after taking double the amount of the same time. Since I call pthread_join sequentially, the call to pthread_join(tid[5], &stat6); (which collects the 5th threads’ return value) halts the execution of the master program until thread 5 finishes its execution. Hence, even though threads 6 through 8 have returned, the master program cannot collect their return value until it collects thread number fives’ return value. This phenomenon has a major negative impact on the performance of the program. Hence, we see the ugly numbers in the table present above.

One way to avoid this problem would be to keep track of which threads have finished executing, and collect only those. This can be implemented by making each thread explicitly signal the master thread when done, so that the master thread can issue a join call only to that particular thread. Another workaround I found to this issue is to program the threads to update a global data structure (which in this case would the product matrix). This way, we wouldn’t need to call the pthread_join function as the threads wouldn’t be returning any value. I feel that this would be the fastest approach. Hope this helped.

Github code for parallel divide and conquer can be found here.

Github code for parallel strassen’s can be found here.

Until next time.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s