Last Updated: 2024-12-02 Mon 21:37

CMSC216 Lab12: Matrices and Memory Optimization

CODE DISTRIBUTION: lab12-code.zip

CHANGELOG:

Mon Dec 2 09:32:44 PM EST 2024

More bugs with the testing code were identified which cause running tests repeatedly to spuriously fail with a Cannot create directory: file exists error. This has been corrected in the current code pack. Students can adjust the test_lab12.org file manually by finding and editing the following line

  40: >> mkdir -p test-results           ## LINE 40: ADD THE -p OPTION TO mkdir
Mon Dec 2 11:09:14 AM EST 2024

A minor bug was reported in discussion which causes the first run of make test-code to fail with an error like "no such file or directory". Running make test-code a second time completes the test normally. This test can be corrected by editing test_lab12.org and adding the indicated line:

  ...
  * CODE: csums_threaded Performance
  
  ** Performance
  Runs the ~csums_thread_main~ once and checks that the speed of running
  with 2 threads has sufficient speedup (at least 1.5x). This test is
  flaky in that on any virtualized platform (GRACE, Gradescope) speedup
  will be somewhat unreliable so the bar is set low.
  
  #+TESTY: timeout=25
  #+BEGIN_SRC sh
  >> mkdir test-results                      #### ADD THIS LINE ####
  >> resultfile=test-results/out.tmp
  >> ./csums_thread_main 10000 10000 2 > $resultfile

The line has been added into the current codepack. Since make test is not affected, this bug will not affect results on Gradescope.

1 Rationale

Threads are frequently used to improve the performance of applications when a computation can be subdivided between the threads. This exercise explores doing so in the context of summing elements in a matrix. Completing it will give a basic overview of launching threads, divvying up work, and synchronizing the threads as they contribute a shared result.

Associated Reading / Preparation

Bryant and O'Hallaron: Ch 12.3-6 covers threaded programming.

Grading Policy

Credit for this exercise is earned by completing the code/asnwers here and submitting a Zip of the work to Gradescope. Students are responsible to check that the results produced locally via make test are reflected on Gradescope after submitting their completed Zip. Successful completion earns 1 Engagement Point.

Lab Exercises are open resource/open collaboration and students are encouraged to cooperate on labs. Students may submit work as groups of up to 5 to Gradescope: one person submits then adds the names of their group members to the submission.

See the full policies in the course syllabus.

2 Codepack

The codepack for the lab contains the following files:

File   Description
QUESTIONS.txt EDIT Questions to answer: fill in the multiple choice selections in this file.
csums_thread.h Provided Header file defining some types and functions for
csums_thread_util.c Provided Utility functions to manipulate matrices/vectors
csums_thread_funcs.c EDIT Small "algorithms" to time arithmetic operations
csums_thread_main.c EDIT Main function that times different summing techinques
Makefile Build Enables make test and make zip
QUESTIONS.txt.bk Backup Backup copy of the original file to help revert if needed
QUESTIONS.md5 Testing Checksum for answers in questions file
test_quiz_filter Testing Filter to extract answers from Questions file, used in testing
test_lab12.org Testing Tests for this discussion
testy Testing Test running scripts

3 Overview and Background

This lab deals with matrices and vectors again similar to a previous lab/HW. Review that material if you need a refresher on data types such as matrix_t and vector_t.

The code focuses on completing two functions, col_sums_threaded() and col_sums_worker() to produce a threaded column sums implementation.

4 Basic Overview of Threads

Threads allow the creation of multiple "threads of execution" within the same program. One primary reason to do this is in an attempt to speed up the execution of program by utilizing several threads each of which can be independently scheduled to run. This allows multiple cores/CPUs to be simultaneously used which may speed up a computation.

The general flow of thread usage is as follows.

  1. A "main" thread will create several work threads
  2. Worker threads usually begin execution in a function of some sort. While they share most memory with the main thread and each other, each thread often needs to be pointed at some contextual information on what part of the computation it should perform
  3. Worker threads perform their own computations as independently as possible. When they need to coordinate, they will often use a mutex, an operating system supported data structure that enables mutual exclusion of threads: only one thread at a time will perform certain operations to avoid potential problems.
  4. The main thread will wait for all worker threads to complete their tasks then perform any final wrap-up before moving to the next task or reporting answers

This flow is present in the provided code csums_thread_main.c and csum_thread_funcs.c. Examine this flow carefully and the specific function calls used to facilitate it.

5 QUESTIONS.txt File Contents

Below are the contents of the QUESTIONS.txt file for the lab. Follow the instructions in it to complete the QUIZ and CODE questions for the lab.

                           _________________

                            LAB12 QUESTIONS
                           _________________





Exercise Instructions
=====================

  Follow the instructions below to experiment with topics related to
  this exercise.
  - For sections marked QUIZ, fill in an (X) for the appropriate
    response in this file. Use the command `make test-quiz' to see if
    all of your answers are correct.
  - For sections marked CODE, complete the code indicated. Use the
    command `make test-code' to check if your code is complete.
  - DO NOT CHANGE any parts of this file except the QUIZ sections as it
    may interfere with the tests otherwise.
  - If your `QUESTIONS.txt' file seems corrupted, restore it by copying
    over the `QUESTIONS.txt.bk' backup file.
  - When you complete the exercises, check your answers with `make test'
    and if all is well, create a zip file with `make zip' and upload it
    to Gradescope. Ensure that the Autograder there reflects your local
    results.
  - IF YOU WORK IN A GROUP only one member needs to submit and then add
    the names of their group.


OVERVIEW
========

  This lab emphasizes optimizing a matrix computation using two
  techniques.
  1. Ensuring the matrix is accessed in a pattern that is favorable to
     cache.
  2. Employing multiple threads to subdivide the computation for
     parallel execution.

  UNLESS BOTH THESE TECHNIQUES ARE USED, PERFORMANCE WILL BE
  SUBOPTIMAL. Both techniques are discussed as they are both needed for
  an upcoming project. Technique (1) has been previously discussed in an
  assignment but is worth reviewing.


QUIZ: col_sums_A() vs col_sums_B()
==================================

  In the file `example_matsums_funcs.c' which contains the solution to a
  previous lab/HW, what is the primary difference between the two
  functions `col_sums_A() / col_sums_B()'?


(A) Code Differences
~~~~~~~~~~~~~~~~~~~~

  - ( ) One version allocates and uses more memory than the other which
    will likely affect their relative speeds
  - ( ) One version uses a row-major matrix while the other uses a
    column-major matrix meaning that the underlying memory layout of the
    matrices used is different between the functions
  - ( ) They both compute the sums of all columsn but traverse the
    matrix in a different order
  - ( ) They compute two different properties of the columns of the
    matrix and therefore necessarily need to traverse the matrix in a
    different order


(B) Speed Differences
~~~~~~~~~~~~~~~~~~~~~

  Between the two versions, which is faster and why?
  - ( ) col_sums_A() is faster as it traverses the elements of the
    matrix with no stride, sequentially visiting memory addresses in a
    pattern that is greatly sped up by CPU caches.
  - ( ) col_sums_B() is faster as it traverses the elements of the
    matrix with no stride, sequentially visiting memory addresses in a
    pattern that is greatly sped up by CPU caches.
  - ( ) col_sums_A() is faster as it allocates less memory overall so
    that there is less "memory cache pressure" and the computation can
    proceed faster
  - ( ) col_sums_B() is faster as it allocates less memory overall so
    that there is less "memory cache pressure" and the computation can
    proceed faster


QUIZ: Thread Startup and Shutdown
=================================

(A) Thread Creation and Completion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Examine `csum_thread_main.c' and `csum_thread_func.c'. Determine where
  worker threads are created and the function used to do so.
  - ( ) Worker threads are created in `main()' using a version of
    `fork()' with optional arguments that were not previously used to
    create processes.
  - ( ) Worker threads are created in `main()' using the
    `pthread_create()' function which has as one of its arguments a
    function that threads start running
  - ( ) Worker threads are created in `col_sums_threaded()' using a
    version of `fork()' with optional arguments that were not previously
    used to create processes.
  - ( ) Worker threads are created in `col_sums_threaded()' using the
    `pthread_create()' function which has as one of its arguments a
    function that threads start running


(B) Thread Completion
~~~~~~~~~~~~~~~~~~~~~

  Research the function that is used to block a thread until another
  thread completes.  This is used in `csum_thread_func.c' to block the
  "main" thread until all worker threads have completed.  Which of the
  following is the most appropriate function to block until a thread has
  completed.

  - ( ) wait_pid(thread_id, NULL, 0)
  - ( ) pthread_join(pthread_struct, NULL)
  - ( ) pthread_wait(pthread_struct, NULL)
  - ( ) kill(thread_id, 9)


QUIZ: col_sums_single() vs col_sums_worker()
============================================

  In the file csums_thread_funcs.c, compare the two functions
  col_sums_single() and col_sums_worker().

  Which of the following denote similarities between these two
  functions. Mark all that apply.

  - ( ) Both have the same types of arguments
  - ( ) Both have the same basic computational loop to sum some/all of
    matrix columns
  - ( ) Both make use of a local array allocated in the function to
    store partial results
  - ( ) Both use mutexes for controlled access to shared data

  Examine the differences between these two functions and mark ALL that
  apply below.
  - ( ) col_sums_single() sums all columns while col_sums_worker() only
    computes partial sums of columns
  - ( ) col_sums_worker() accepts its argument through a struct that
    must be caste to access it fields
  - ( ) col_sums_single() accesses the mat/vec through locals while
    col_sums_threaded() accesses the mat/vec through a global variable

  A comment in the `col_sums_worker()' function indicates that, before
  adding on to `vec.data[]', a worker thread should lock a mutex.  What
  reason is given for this requirement?
  - ( ) Locking the mutex converts a pointer from NULL to a valid
    location so that the thread does not cause a segmentation fault.
  - ( ) Locking the mutex improves the performance of adding onto
    `vec.data[]' and speedup is the goal
  - ( ) Locking prevents multiple threads from simultaenously modifying
    the results which may corrupt them.
  - ( ) Trick question: this is actually a lock-free solution which does
    not require any locking.


CODE: Complete col_sums_thread()
================================

  Complete the TODO items in csums_thread_funcs.c to finish a threaded
  implementation of the column summing.

  Build it and test it on the command line to ensure that you see some
  speedup from using threads.
  ,----
  | >> make
  | gcc -Wall -Werror -g -Og -c csums_thread_main.c		
  | gcc -Wall -Werror -g -Og -c csums_thread_funcs.c		
  | gcc -Wall -Werror -g -Og -c csums_thread_util.c
  | gcc -Wall -Werror -g -Og -o csums_thread_main csums_thread_main.o csums_thread_funcs.o csums_thread_util.o
  | 
  | >> ./csums_thread_main 
  | usage: ./csums_thread_main <rows> <cols> <thread_count>
  | 
  | >> ./csums_thread_main 10000 10000 2
  | col_sums_single wall time: 9.7008e-02 sec
  | col_sum_threaded wall time: 5.8766e-02 sec
  | 2 threads speedup: 1.65
  | 
  | >> ./csums_thread_main 10000 10000 4
  | col_sums_single wall time: 9.6832e-02 sec
  | col_sum_threaded wall time: 4.8283e-02 sec
  | 4 threads speedup: 2.01
  `----

6 Submission

Follow the instructions at the end of Lab01 if you need a refresher on how to upload your completed lab zip to Gradescope.


Author: Chris Kauffman (profk@umd.edu)
Date: 2024-12-02 Mon 21:37