CMSC216 Lab10: Cache Optimization and Matrices
- Due: 11:59pm Sun 26-Apr-2026 on Gradescope
- Approximately 1.00% of total grade
CODE DISTRIBUTION: lab10-code.zip
CHANGELOG:
- Tue Apr 21 03:48:01 PM EDT 2026
- Fixed several references to the "Grace" server which as used in the past and replaced them with Zaratan which is the server used for the current semester. This should not affect Quiz or Code answers.
1 Rationale
Basic understanding of how the memory system works to fulfill load/store requests is essential when one wishes to optimize program performance to reduce runtime. Having an intuition on which styles of coding exploit cache memories will offer great performance benefits on most modern computing architectures.
The first problem times several code variations using the clock()
function. The variations concern several types of structs that are
arranged differently in memory shows how data layout affects memory
cache performance. It illustrates how sequential access patterns tend
to be better than strided and random access patterns.
The second problem asks students to study a matrix problem which has poor cache performance and adjust the algorithm used to favor cache more effectively. This leads to an increase in performance.
Associated Reading / Preparation
Bryant and O'Hallaron: Ch 6 on The Memory Hierarchy. Ch 6.5 and 6.6 may be particularly helpful
Grading Policy
Credit for this lab is earned by completing the code/answers in the
Lab codepack and submitting a Zip of the work to Gradescope preferably
via running make submit. Students are responsible to check that the
results produced locally are reflected on Gradescope after submitting
their completed Zip.
Lab Exercises are Free Collaboration and students are encouraged to cooperate on labs. Students may submit work as groups of up to 5 to Gradescope: one person submits then adds the names of their group members to the submission.
No late submissions are accepted for Lab work but the lowest two lab scores for the semester will be dropped including zeros due to missing submissions. See the full policies in the course syllabus.
2 Codepack
The codepack for the HW contains the following files:
| File | State | Description |
|---|---|---|
clock_demo.c |
Provided | Problem 1 demo on how the clock() function is used |
struct_stride.c |
EDIT | Problem 1 code used to observe CPU timing differences |
test_struct_stride.org |
Testing | Problem 1 code tests |
matvec.h |
Provided | Problem 2 Header file defining some types and functions for |
matvec_util.c |
Provided | Problem 2 Utility functions to manipulate matrices/vectors |
matsums_funcs.c |
EDIT | Problem 2 Small "algorithms" to time arithmetic operations |
matsums_main.c |
EDIT | Problem 2 Main function that times different summing techniques |
test_matsums.org |
Testing | Problem 2 code tests |
QUESTIONS.txt |
EDIT | Questions to answer |
Makefile |
Build | Enables make test and make zip |
test_lab10.org |
Testing | Tests for this lab |
testy |
Testing | Test running scripts |
gradescope-submit |
Provided | Allows submission from the command line |
3 Timing Notes
This lab involves optimizing the performance of a function and the behavior of doing so will vary from one machine to the next. If run on an arbitrary machine, the optimization techniques intended may or may not provide the expected speedup.
For evaluation purposes, log into the machine login.zaratan.umd.edu
to test the timing. This machine has known characteristics (processor
speed and cache behavior) and should provide reasonably consistent
results. Other machines like laptops might work but are not
guaranteed.
4 Background: The clock() Function
The code block below illustrates the basic usage pattern for the
clock() function.
#include <time.h> // for clock() and clock_t
{
clock_t begin = clock(); // current cpu moment
// Perform computation that takes a while;
clock_t end = clock(); // later cpu moment
double cpu_time = // convert into seconds
((double) (end-begin)) / CLOCKS_PER_SEC;
printf("Elapsed CPU Time: %f second\n", cpu_time);
}
A few caveats are worth pointing out.
- The
clock_ttype is often a large integer type likeunsigned longwhich is why one can perform subtraction using it. Don't rely on this being the case and just use the type indicated as shown. clock()itself returns a number corresponding to the number of CPU "ticks" which have occurred while the program runs. This requires conversion into the number of seconds shown above. It makes use of theCLOCKS_PER_SECONDconstant which is included viatime.h.- The time computed by this method is equivalent to the
usertime reported by thetimeutility: it is how much CPU time the user program has used. It is relevant to timing computational loops but is complemented by "wall time" which requires use of different timing functions likegettimeofday()to compute. WARNING: Timing code runs is inherently noisy and will vary from one run to the next.
clock()is reliable for timing computations that take around 0.001 seconds (a thousandth of a second). For times shorter than that, the variations in timing will likely be nearly as large as the total time which makes timing shorter activities unreliable.Adjust program parameters like the number of loop iterations so reported times are at least 1e-03 seconds. Ideally times should be larger, in the 1e-01 second range to be trustworthy.
5 QUESTIONS.txt File Contents
Below are the contents of the QUESTIONS.txt file for the lab.
Follow the instructions in it to complete the QUIZ and CODE questions
for the lab.
_________________
LAB10 QUESTIONS
_________________
Exercise Instructions
=====================
Follow the instructions below to experiment with topics related to
this exercise.
- For sections marked QUIZ, fill in an (X) for the appropriate
response in this file. Use the command `make test-quiz' to see if
all of your answers are correct.
- For sections marked CODE, complete the code indicated. Use the
command `make test-code' to check if your code is complete.
- DO NOT CHANGE any parts of this file except the QUIZ sections as it
may interfere with the tests otherwise.
- If your `QUESTIONS.txt' file seems corrupted, restore it by copying
over the `QUESTIONS.txt.bk' backup file.
- When you complete the exercises, check your answers with `make test'
and if all is well. Create a zip file and submit it to Gradescope
with `make submit'. Ensure that the Autograder there reflects your
local results.
- IF YOU WORK IN A GROUP only one member needs to submit and then add
the names of their group on Gradescope.
Timing Note
===========
This code has been timed on Zaratan to verify it behaves as expected
in most cases but as with all timing code, system load and other
circumstances can affect the actual timing results. If you strange
times that seem to conflict with the expected answers, report it to
staff to get help.
PROBLEM 1 Overview: clock_demo.c Program
========================================
Demoers will walk through the `clock_demo.c' program to show how the
`clock()' function is used in practice. Students should look
carefully at the techniques used to time the two different sections of
code and print that timing info. These will be needed to fill in the
subsequent programs.
Running the `clock_demo' program on the command line will produce
results that look like the following:
,----
| >> make clock_demo
| gcc -Wall -Werror -g -Og -Wall -Werror -g -Og clock_demo.c -o clock_demo
|
| >> ./clock_demo
| usage: ./clock_demo <arrlen> <repeats>
|
| >> ./clock_demo 1000 1000
| Summing array length 1000 with 1000 repeats, ascending
| Summing array length 1000 with 1000 repeats, descending
| method: sum ascending CPU time: 2.3750e-03 sec sum: 499500
| method: sum descending CPU time: 1.5760e-03 sec sum: 499500
|
| >> ./clock_demo 100000 1000
| Summing array length 100000 with 1000 repeats, ascending
| Summing array length 100000 with 1000 repeats, descending
| method: sum ascending CPU time: 6.6969e-02 sec sum: 704982704
| method: sum descending CPU time: 6.2286e-02 sec sum: 704982704
|
| >> ./clock_demo 100000 10000
| Summing array length 100000 with 10000 repeats, ascending
| Summing array length 100000 with 10000 repeats, descending
| method: sum ascending CPU time: 6.2730e-01 sec sum: 704982704
| method: sum descending CPU time: 6.1995e-01 sec sum: 704982704
`----
PROBLEM 1 CODE: `struct_stride.c' Program
=========================================
The provided `struct_stride.c' program has a number of TODO items in
it related to timing several computations and reporting their results.
It is best to complete these items first and then attempt to answer
the quiz questions as some questions require running the program and
observing timing results.
Use the lab's description of how the `clock()' function works to
complete TODO items and print the results.
When completed, the program can be run as show below:
,----
| >> ./struct_stride
| usage: ./struct_stride <arr_length> <num_iters>
|
| >> ./struct_stride 10000000 100
| method: int_field_base CPU time: 1.2345e-01 sec sum: 0
| method: arr_field_base CPU time: 1.2345e-01 sec sum: 0
| method: int_field_optm CPU time: 1.2345e-01 sec sum: 0
| method: arr_field_optm CPU time: 1.2345e-01 sec sum: 0
`----
NOTE: the timing information has intentionally been changed to read
1.2345e-01 as calculating this timing information is part of the lab.
PROBLEM 1 QUIZ: Timing `struct_stride.c' Runs
=============================================
NOTE: timing code varies from one machine to the next. The answers
below were tested on Zaratan and appear to be stable but system load
may affect the outcome.
Relative Speed of Struct Layouts
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
After adding in calls to `clock()' and code the print times, run the
`struct_strid' program.
Run the program with a large array and a modest amount of array
iterations such as using the following parameters:
,----
| ./struct_stride 6000000 100
`----
Examine the times reported.
Which option below reflects the relative speeds of the
layout/algorithm combinations?
,----
| ------SLOWEST--------------------------------------------FASTEST-----
| - ( ) arr_field_base > arr_field_optm > int_field_base > int_field_optm
| - ( ) int_field_base > int_field_optm > arr_field_base > arr_field_optm
| - ( ) arr_field_base > int_field_base > arr_field_optm > int_field_optm
| - ( ) int_field_base > arr_field_base > int_field_optm > arr_field_optm
`----
Order of Memory Access
~~~~~~~~~~~~~~~~~~~~~~
Below are several memory layouts of A/B elements to consider.
-------------------------------------------------------------------------
Byte Offset +00 +04 +08 +12 +16 +80 +84 +88 +92 +96
LAYOUT1 A0 A1 A2 A3 A4 ... B0 B1 B2 B3 B4 ...
-------------------------------------------------------------------------
-------------------------------------------------------------------
Byte Offset +00 +04 +08 +12 +16 +20 +24 +28 32 +36
LAYOUT 2 A0 B0 A1 B1 A2 B2 A3 B3 A4 B4 ...
-------------------------------------------------------------------
For each of following, indicate the best suited option.
The `int_field_base' approach code that is timed..
- ( ) Uses memory LAYOUT 1 and visit elements in the order A0, B0, A1,
B1, A2, B2, etc.
- ( ) Uses memory LAYOUT 1 and visit elements in the order A0, A1, A2,
... B0, B1, B2, etc.
- ( ) Uses memory LAYOUT 2 and visit elements in the order A0, B0, A1,
B1, A2, B2, etc.
- ( ) Uses memory LAYOUT 2 and visit elements in the order A0, A1, A2,
... B0, B1, B2, etc.
The `arr_field_base' approach code that is timed..
- ( ) Uses memory LAYOUT 1 and visit elements in the order A0, B0, A1,
B1, A2, B2, etc.
- ( ) Uses memory LAYOUT 1 and visit elements in the order A0, A1, A2,
... B0, B1, B2, etc.
- ( ) Uses memory LAYOUT 2 and visit elements in the order A0, B0, A1,
B1, A2, B2, etc.
- ( ) Uses memory LAYOUT 2 and visit elements in the order A0, A1, A2,
... B0, B1, B2, etc.
The `int_field_optm' approach code that is timed..
- ( ) Uses memory LAYOUT 1 and visit elements in the order A0, B0, A1,
B1, A2, B2, etc.
- ( ) Uses memory LAYOUT 1 and visit elements in the order A0, A1, A2,
... B0, B1, B2, etc.
- ( ) Uses memory LAYOUT 2 and visit elements in the order A0, B0, A1,
B1, A2, B2, etc.
- ( ) Uses memory LAYOUT 2 and visit elements in the order A0, A1, A2,
... B0, B1, B2, etc.
The `arr_field_optm' approach code that is timed..
- ( ) Uses memory LAYOUT 1 and visit elements in the order A0, B0, A1,
B1, A2, B2, etc.
- ( ) Uses memory LAYOUT 1 and visit elements in the order A0, A1, A2,
... B0, B1, B2, etc.
- ( ) Uses memory LAYOUT 2 and visit elements in the order A0, B0, A1,
B1, A2, B2, etc.
- ( ) Uses memory LAYOUT 2 and visit elements in the order A0, A1, A2,
... B0, B1, B2, etc.
int_field_base VS arr_field_base
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Examine the differences between the two types of structs that are used
in `struct_stride.c' called `int_field_t' and `arr_field_t'.
Now examine the first 2 code blocks that use these structs called,
`int_field_base' and `arr_field_base'. Both involve arrays and structs
which store an equal number of positive and negative
integers. However, they differ in the overall layout of those
integers. Both use loops sum the "a" numbers first then sum the "b"
numbers, then combine them for the total sum.
Which of the following are possible explanations for the timing
difference between `int_field_base' and `arr_field_base'?
- ( ) `int_field_base' must perform more loop iterations than
`arr_field_base' which will making it slower.
- ( ) `arr_field_base' uses more memory to store then number than
`int_field_base' and this additional memory increases speed.
- ( ) `int_field_base' has a memory layout that is ABABABAB so when
adding A elements, there is a "stride" through
memory. `arr_field_base' has a layout like AAAAABBBBB so adding
elements has no stride.
- ( ) `int_field_base' uses struct field access. The assembly
instructions to access array fields are slower than the assembly
instructions that access array elements. This makes `arr_field_base'
faster due to its use of plain integer arrays.
BASE vs OPTM versions
~~~~~~~~~~~~~~~~~~~~~
The last two layout/algorithm sections are labeled "optm" as they
perform a simple code transformation from their "base" version.
Select ALL of the items below that are accomplished with this
transformation.
- ( ) Fewer loop checks/increments are needed as there is only one
loop instead of 2.
- ( ) The number of loop iterations is lowered for all loops in the
optm version.
- ( ) The memory stride is eliminated for the int_field_optm as both
a/b elements are added each iteration.
- ( ) The algorithmic complexity is reduced from O(N^2) in the "base"
to O(N) in the "optm" version.
PROBLEM 2 QUIZ: Questions matsums
=================================
(A) Vector and Matrix
~~~~~~~~~~~~~~~~~~~~~
Examine the header file `matvec.h' which gives type definitions and
declares functions associated with a simple matrix and vector type.
Which of the following best describes how the `matrix_t` type in the
lab stores numerical matrices?
- ( ) The 2D matrix is stored as a 2D array in the field data; this
allows direct access via 2D indexing via `mat.data[i][j]`
- ( ) The type `matrix_t` is defined as the same as a `int[][]` so the
2D data can be directly accessed via `mat[i][j]`
- ( ) The 2D matrix is stored as a 1D array in the mat.data field. 2D
accesses like `(i,j)` must be translated to 1D locations via the
formula `mat.data[i*mat.cols) + j]`
- ( ) The 2D matrix is stored in a hash table with the `mat.data`
field storing data and a hashing function used to compute the 1D
location of a 2D element.
(B) Getting and Setting Elements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For the `matrix_t' and `vector_t' types, convenient means to get and
set elements is provided. This mechanism is used in the utility
functions in `matvec_util.c' and defined in `matvec.h'.
Which of these best describes the means to access matrix elements
provided and used in the lab?
- ( ) The functions `MGET(mat,i,j)` and `MSET(mat,i,j,x)` are
implemented in the header to get and set elements of a `matrix_t`
- ( ) The macros `MGET(mat,i,j)` and `MSET(mat,i,j,x)` are `#define`'d
in the header to get and set elements of a `matrix_t`
- ( ) Instructions are given about what formula and fields to use when
accessing and changing elements of the matrix.
- ( ) No easy way is given for to work with the `matrix_t` type or its
elements.
(C) row_sums vs col_sums Timing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compile and run the `matsums_main' executable using the provided
`Makefile' (typing `make' should suffice).
(A) row_sums vs col_sums Timing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
LOG INTO THE MACHINE Zaratan for timing experiments
- This is the machine for which the times of this code have been tuned
- Using a different machine for your timing may lead to problems
Compile and run the `matsums_main' executable using the provided
`Makefile' (typing `make' should suffice). Note how to run this
program: pass the row and column sizes as command line parameters.
,----
| >> ./matsums_main
| usage: ./matsums_main <rows> <cols>
`----
Run this program with differing parameters which control the number of
rows and columns in the matrix which is summed. Study the results,
including on a large matrix with 18000 rows and 8000 columns. Note
any speed differences between the Row and Column sum routines on large
matrices.
The result of this timing indicated the following relative speeds of
these two provided algorithms.
- ( ) Summing ROWS was SLOWER than summing COLUMNS
- ( ) Summing ROWS was the SAME speed as summing COLUMNS
- ( ) Summing ROWS was FASTER than summing COLUMNS
- ( ) The speed varies wildly from run to run so it is harder to say
which of ROWS or COLUMNS is faster
(B) row_sums vs col_sums speed
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Examine the source code for the functions `row_sums()' and
`col_sums()' in the file `matsums_funcs.c'. Describe why the
timing/speed differences you observed previously are occurring based
on features of the source code you see and the layout of the
`matrix_t' that is being summed. Consider especially the memory layout
and how the that of elements visited in each approach maps to the
actual data layout of the matrix data.
Which of the following best describes a reason for the relative speeds
of the algorithms?
- ( ) The two algorithms had equal speed as they perform the same
number of operations and have the same `O(R*C)` complexity.
- ( ) Row Summing was Faster because it accessed memory sequentially
while Column summing has a large stride in memory leading to poor
cache performance.
- ( ) Column Summing was Faster because it accessed memory
sequentially while Row has a stride in memory leading to poor cache
performance.
- ( ) Column summing was slower because the Row summing algorithm
exhibited better processor pipelining features allowing more
instructions per cycle to be executed.
PROBLEM 2 CODE: opt_col_sums()
==============================
Complete the function `opt_col_sums()' in file `matsums_funcs.c'. This
function should have identical behavior to `col_sums()' (it sums the
columns of the matrix into a provided array). However, it should be
*optimized* so that it achieves performance near to that of the
`row_sums()' function. To achieve this, re-arrange the loops to
iterate as efficiently as possible with respect to the memory
system. Feel free to ask course staff for hints on how to do this or
do some online research.
To time `opt_col_sums()', uncomment relevant blocks in the
`matsums_main.c' program that are provided for it.
Run your code on some of the large matrix examples previously used
such as an 18,000 by 8,000 matrix. If the speed for the optimized
column version is close to that of the row sums, try running the test
cases associated with the code via `make test-code'.
Submission
Follow the instructions at the end of Lab01 if you need a refresher on how to upload your completed exercise zip to Gradescope.