CMSC216 Lab12: Matrices and Memory Optimization
- Due: 11:59pm Mon 09-Dec-2024 on Gradescope
- Approximately 1.00% of total grade
CODE DISTRIBUTION: lab12-code.zip
CHANGELOG:
- Mon Dec 2 09:32:44 PM EST 2024
More bugs with the testing code were identified which cause running tests repeatedly to spuriously fail with a
Cannot create directory: file exists
error. This has been corrected in the current code pack. Students can adjust thetest_lab12.org
file manually by finding and editing the following line40: >> mkdir -p test-results ## LINE 40: ADD THE -p OPTION TO mkdir
- Mon Dec 2 11:09:14 AM EST 2024
A minor bug was reported in discussion which causes the first run of
make test-code
to fail with an error like "no such file or directory". Runningmake test-code
a second time completes the test normally. This test can be corrected by editingtest_lab12.org
and adding the indicated line:... * CODE: csums_threaded Performance ** Performance Runs the ~csums_thread_main~ once and checks that the speed of running with 2 threads has sufficient speedup (at least 1.5x). This test is flaky in that on any virtualized platform (GRACE, Gradescope) speedup will be somewhat unreliable so the bar is set low. #+TESTY: timeout=25 #+BEGIN_SRC sh >> mkdir test-results #### ADD THIS LINE #### >> resultfile=test-results/out.tmp >> ./csums_thread_main 10000 10000 2 > $resultfile
The line has been added into the current codepack. Since
make test
is not affected, this bug will not affect results on Gradescope.
1 Rationale
Threads are frequently used to improve the performance of applications when a computation can be subdivided between the threads. This exercise explores doing so in the context of summing elements in a matrix. Completing it will give a basic overview of launching threads, divvying up work, and synchronizing the threads as they contribute a shared result.
Associated Reading / Preparation
Bryant and O'Hallaron: Ch 12.3-6 covers threaded programming.
Grading Policy
Credit for this exercise is earned by completing the code/asnwers here
and submitting a Zip of the work to Gradescope. Students are
responsible to check that the results produced locally via make test
are reflected on Gradescope after submitting their completed
Zip. Successful completion earns 1 Engagement Point.
Lab Exercises are open resource/open collaboration and students are encouraged to cooperate on labs. Students may submit work as groups of up to 5 to Gradescope: one person submits then adds the names of their group members to the submission.
See the full policies in the course syllabus.
2 Codepack
The codepack for the lab contains the following files:
File | Description | |
---|---|---|
QUESTIONS.txt |
EDIT | Questions to answer: fill in the multiple choice selections in this file. |
csums_thread.h |
Provided | Header file defining some types and functions for |
csums_thread_util.c |
Provided | Utility functions to manipulate matrices/vectors |
csums_thread_funcs.c |
EDIT | Small "algorithms" to time arithmetic operations |
csums_thread_main.c |
EDIT | Main function that times different summing techinques |
Makefile |
Build | Enables make test and make zip |
QUESTIONS.txt.bk |
Backup | Backup copy of the original file to help revert if needed |
QUESTIONS.md5 |
Testing | Checksum for answers in questions file |
test_quiz_filter |
Testing | Filter to extract answers from Questions file, used in testing |
test_lab12.org |
Testing | Tests for this discussion |
testy |
Testing | Test running scripts |
3 Overview and Background
This lab deals with matrices and vectors again similar to a previous
lab/HW. Review that material if you need a refresher on data types
such as matrix_t
and vector_t
.
The code focuses on completing two functions, col_sums_threaded()
and col_sums_worker()
to produce a threaded column sums
implementation.
4 Basic Overview of Threads
Threads allow the creation of multiple "threads of execution" within the same program. One primary reason to do this is in an attempt to speed up the execution of program by utilizing several threads each of which can be independently scheduled to run. This allows multiple cores/CPUs to be simultaneously used which may speed up a computation.
The general flow of thread usage is as follows.
- A "main" thread will create several work threads
- Worker threads usually begin execution in a function of some sort. While they share most memory with the main thread and each other, each thread often needs to be pointed at some contextual information on what part of the computation it should perform
- Worker threads perform their own computations as independently as possible. When they need to coordinate, they will often use a mutex, an operating system supported data structure that enables mutual exclusion of threads: only one thread at a time will perform certain operations to avoid potential problems.
- The main thread will wait for all worker threads to complete their tasks then perform any final wrap-up before moving to the next task or reporting answers
This flow is present in the provided code csums_thread_main.c
and
csum_thread_funcs.c
. Examine this flow carefully and the specific
function calls used to facilitate it.
5 QUESTIONS.txt File Contents
Below are the contents of the QUESTIONS.txt
file for the lab.
Follow the instructions in it to complete the QUIZ and CODE questions
for the lab.
_________________ LAB12 QUESTIONS _________________ Exercise Instructions ===================== Follow the instructions below to experiment with topics related to this exercise. - For sections marked QUIZ, fill in an (X) for the appropriate response in this file. Use the command `make test-quiz' to see if all of your answers are correct. - For sections marked CODE, complete the code indicated. Use the command `make test-code' to check if your code is complete. - DO NOT CHANGE any parts of this file except the QUIZ sections as it may interfere with the tests otherwise. - If your `QUESTIONS.txt' file seems corrupted, restore it by copying over the `QUESTIONS.txt.bk' backup file. - When you complete the exercises, check your answers with `make test' and if all is well, create a zip file with `make zip' and upload it to Gradescope. Ensure that the Autograder there reflects your local results. - IF YOU WORK IN A GROUP only one member needs to submit and then add the names of their group. OVERVIEW ======== This lab emphasizes optimizing a matrix computation using two techniques. 1. Ensuring the matrix is accessed in a pattern that is favorable to cache. 2. Employing multiple threads to subdivide the computation for parallel execution. UNLESS BOTH THESE TECHNIQUES ARE USED, PERFORMANCE WILL BE SUBOPTIMAL. Both techniques are discussed as they are both needed for an upcoming project. Technique (1) has been previously discussed in an assignment but is worth reviewing. QUIZ: col_sums_A() vs col_sums_B() ================================== In the file `example_matsums_funcs.c' which contains the solution to a previous lab/HW, what is the primary difference between the two functions `col_sums_A() / col_sums_B()'? (A) Code Differences ~~~~~~~~~~~~~~~~~~~~ - ( ) One version allocates and uses more memory than the other which will likely affect their relative speeds - ( ) One version uses a row-major matrix while the other uses a column-major matrix meaning that the underlying memory layout of the matrices used is different between the functions - ( ) They both compute the sums of all columsn but traverse the matrix in a different order - ( ) They compute two different properties of the columns of the matrix and therefore necessarily need to traverse the matrix in a different order (B) Speed Differences ~~~~~~~~~~~~~~~~~~~~~ Between the two versions, which is faster and why? - ( ) col_sums_A() is faster as it traverses the elements of the matrix with no stride, sequentially visiting memory addresses in a pattern that is greatly sped up by CPU caches. - ( ) col_sums_B() is faster as it traverses the elements of the matrix with no stride, sequentially visiting memory addresses in a pattern that is greatly sped up by CPU caches. - ( ) col_sums_A() is faster as it allocates less memory overall so that there is less "memory cache pressure" and the computation can proceed faster - ( ) col_sums_B() is faster as it allocates less memory overall so that there is less "memory cache pressure" and the computation can proceed faster QUIZ: Thread Startup and Shutdown ================================= (A) Thread Creation and Completion ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Examine `csum_thread_main.c' and `csum_thread_func.c'. Determine where worker threads are created and the function used to do so. - ( ) Worker threads are created in `main()' using a version of `fork()' with optional arguments that were not previously used to create processes. - ( ) Worker threads are created in `main()' using the `pthread_create()' function which has as one of its arguments a function that threads start running - ( ) Worker threads are created in `col_sums_threaded()' using a version of `fork()' with optional arguments that were not previously used to create processes. - ( ) Worker threads are created in `col_sums_threaded()' using the `pthread_create()' function which has as one of its arguments a function that threads start running (B) Thread Completion ~~~~~~~~~~~~~~~~~~~~~ Research the function that is used to block a thread until another thread completes. This is used in `csum_thread_func.c' to block the "main" thread until all worker threads have completed. Which of the following is the most appropriate function to block until a thread has completed. - ( ) wait_pid(thread_id, NULL, 0) - ( ) pthread_join(pthread_struct, NULL) - ( ) pthread_wait(pthread_struct, NULL) - ( ) kill(thread_id, 9) QUIZ: col_sums_single() vs col_sums_worker() ============================================ In the file csums_thread_funcs.c, compare the two functions col_sums_single() and col_sums_worker(). Which of the following denote similarities between these two functions. Mark all that apply. - ( ) Both have the same types of arguments - ( ) Both have the same basic computational loop to sum some/all of matrix columns - ( ) Both make use of a local array allocated in the function to store partial results - ( ) Both use mutexes for controlled access to shared data Examine the differences between these two functions and mark ALL that apply below. - ( ) col_sums_single() sums all columns while col_sums_worker() only computes partial sums of columns - ( ) col_sums_worker() accepts its argument through a struct that must be caste to access it fields - ( ) col_sums_single() accesses the mat/vec through locals while col_sums_threaded() accesses the mat/vec through a global variable A comment in the `col_sums_worker()' function indicates that, before adding on to `vec.data[]', a worker thread should lock a mutex. What reason is given for this requirement? - ( ) Locking the mutex converts a pointer from NULL to a valid location so that the thread does not cause a segmentation fault. - ( ) Locking the mutex improves the performance of adding onto `vec.data[]' and speedup is the goal - ( ) Locking prevents multiple threads from simultaenously modifying the results which may corrupt them. - ( ) Trick question: this is actually a lock-free solution which does not require any locking. CODE: Complete col_sums_thread() ================================ Complete the TODO items in csums_thread_funcs.c to finish a threaded implementation of the column summing. Build it and test it on the command line to ensure that you see some speedup from using threads. ,---- | >> make | gcc -Wall -Werror -g -Og -c csums_thread_main.c | gcc -Wall -Werror -g -Og -c csums_thread_funcs.c | gcc -Wall -Werror -g -Og -c csums_thread_util.c | gcc -Wall -Werror -g -Og -o csums_thread_main csums_thread_main.o csums_thread_funcs.o csums_thread_util.o | | >> ./csums_thread_main | usage: ./csums_thread_main <rows> <cols> <thread_count> | | >> ./csums_thread_main 10000 10000 2 | col_sums_single wall time: 9.7008e-02 sec | col_sum_threaded wall time: 5.8766e-02 sec | 2 threads speedup: 1.65 | | >> ./csums_thread_main 10000 10000 4 | col_sums_single wall time: 9.6832e-02 sec | col_sum_threaded wall time: 4.8283e-02 sec | 4 threads speedup: 2.01 `----
6 Submission
Follow the instructions at the end of Lab01 if you need a refresher on how to upload your completed lab zip to Gradescope.