Last Updated: 2026-04-21 Tue 15:48

CMSC216 Lab10: Cache Optimization and Matrices

CODE DISTRIBUTION: lab10-code.zip

CHANGELOG:

Tue Apr 21 03:48:01 PM EDT 2026
Fixed several references to the "Grace" server which as used in the past and replaced them with Zaratan which is the server used for the current semester. This should not affect Quiz or Code answers.

1 Rationale

Basic understanding of how the memory system works to fulfill load/store requests is essential when one wishes to optimize program performance to reduce runtime. Having an intuition on which styles of coding exploit cache memories will offer great performance benefits on most modern computing architectures.

The first problem times several code variations using the clock() function. The variations concern several types of structs that are arranged differently in memory shows how data layout affects memory cache performance. It illustrates how sequential access patterns tend to be better than strided and random access patterns.

The second problem asks students to study a matrix problem which has poor cache performance and adjust the algorithm used to favor cache more effectively. This leads to an increase in performance.

Associated Reading / Preparation

Bryant and O'Hallaron: Ch 6 on The Memory Hierarchy. Ch 6.5 and 6.6 may be particularly helpful

Grading Policy

Credit for this lab is earned by completing the code/answers in the Lab codepack and submitting a Zip of the work to Gradescope preferably via running make submit. Students are responsible to check that the results produced locally are reflected on Gradescope after submitting their completed Zip.

Lab Exercises are Free Collaboration and students are encouraged to cooperate on labs. Students may submit work as groups of up to 5 to Gradescope: one person submits then adds the names of their group members to the submission.

No late submissions are accepted for Lab work but the lowest two lab scores for the semester will be dropped including zeros due to missing submissions. See the full policies in the course syllabus.

2 Codepack

The codepack for the HW contains the following files:

File State Description
clock_demo.c Provided Problem 1 demo on how the clock() function is used
struct_stride.c EDIT Problem 1 code used to observe CPU timing differences
test_struct_stride.org Testing Problem 1 code tests
matvec.h Provided Problem 2 Header file defining some types and functions for
matvec_util.c Provided Problem 2 Utility functions to manipulate matrices/vectors
matsums_funcs.c EDIT Problem 2 Small "algorithms" to time arithmetic operations
matsums_main.c EDIT Problem 2 Main function that times different summing techniques
test_matsums.org Testing Problem 2 code tests
QUESTIONS.txt EDIT Questions to answer
Makefile Build Enables make test and make zip
test_lab10.org Testing Tests for this lab
testy Testing Test running scripts
gradescope-submit Provided Allows submission from the command line

3 Timing Notes

This lab involves optimizing the performance of a function and the behavior of doing so will vary from one machine to the next. If run on an arbitrary machine, the optimization techniques intended may or may not provide the expected speedup.

For evaluation purposes, log into the machine login.zaratan.umd.edu to test the timing. This machine has known characteristics (processor speed and cache behavior) and should provide reasonably consistent results. Other machines like laptops might work but are not guaranteed.

4 Background: The clock() Function

The code block below illustrates the basic usage pattern for the clock() function.

#include <time.h>               // for clock() and clock_t

{
  clock_t begin = clock();      // current cpu moment

  // Perform computation that takes a while;

  clock_t end = clock();        // later cpu moment

  double cpu_time =             // convert into seconds
    ((double) (end-begin)) / CLOCKS_PER_SEC;

  printf("Elapsed CPU Time: %f second\n", cpu_time);
}

A few caveats are worth pointing out.

  • The clock_t type is often a large integer type like unsigned long which is why one can perform subtraction using it. Don't rely on this being the case and just use the type indicated as shown.
  • clock() itself returns a number corresponding to the number of CPU "ticks" which have occurred while the program runs. This requires conversion into the number of seconds shown above. It makes use of the CLOCKS_PER_SECOND constant which is included via time.h.
  • The time computed by this method is equivalent to the user time reported by the time utility: it is how much CPU time the user program has used. It is relevant to timing computational loops but is complemented by "wall time" which requires use of different timing functions like gettimeofday() to compute.
  • WARNING: Timing code runs is inherently noisy and will vary from one run to the next. clock() is reliable for timing computations that take around 0.001 seconds (a thousandth of a second). For times shorter than that, the variations in timing will likely be nearly as large as the total time which makes timing shorter activities unreliable.

    Adjust program parameters like the number of loop iterations so reported times are at least 1e-03 seconds. Ideally times should be larger, in the 1e-01 second range to be trustworthy.

5 QUESTIONS.txt File Contents

Below are the contents of the QUESTIONS.txt file for the lab. Follow the instructions in it to complete the QUIZ and CODE questions for the lab.

                           _________________

                            LAB10 QUESTIONS
                           _________________


Exercise Instructions
=====================

  Follow the instructions below to experiment with topics related to
  this exercise.
  - For sections marked QUIZ, fill in an (X) for the appropriate
    response in this file. Use the command `make test-quiz' to see if
    all of your answers are correct.
  - For sections marked CODE, complete the code indicated. Use the
    command `make test-code' to check if your code is complete.
  - DO NOT CHANGE any parts of this file except the QUIZ sections as it
    may interfere with the tests otherwise.
  - If your `QUESTIONS.txt' file seems corrupted, restore it by copying
    over the `QUESTIONS.txt.bk' backup file.
  - When you complete the exercises, check your answers with `make test'
    and if all is well. Create a zip file and submit it to Gradescope
    with `make submit'. Ensure that the Autograder there reflects your
    local results.
  - IF YOU WORK IN A GROUP only one member needs to submit and then add
    the names of their group on Gradescope.


Timing Note
===========

  This code has been timed on Zaratan to verify it behaves as expected
  in most cases but as with all timing code, system load and other
  circumstances can affect the actual timing results. If you strange
  times that seem to conflict with the expected answers, report it to
  staff to get help.


PROBLEM 1 Overview: clock_demo.c Program
========================================

  Demoers will walk through the `clock_demo.c' program to show how the
  `clock()' function is used in practice.  Students should look
  carefully at the techniques used to time the two different sections of
  code and print that timing info. These will be needed to fill in the
  subsequent programs.

  Running the `clock_demo' program on the command line will produce
  results that look like the following:
  ,----
  | >> make clock_demo
  | gcc -Wall -Werror -g -Og -Wall -Werror -g -Og    clock_demo.c   -o clock_demo
  | 
  | >> ./clock_demo 
  | usage: ./clock_demo <arrlen> <repeats>
  | 
  | >> ./clock_demo 1000 1000
  | Summing array length 1000 with 1000 repeats, ascending
  | Summing array length 1000 with 1000 repeats, descending
  | method:  sum ascending CPU time: 2.3750e-03 sec   sum: 499500
  | method: sum descending CPU time: 1.5760e-03 sec   sum: 499500
  | 
  | >> ./clock_demo 100000 1000
  | Summing array length 100000 with 1000 repeats, ascending
  | Summing array length 100000 with 1000 repeats, descending
  | method:  sum ascending CPU time: 6.6969e-02 sec   sum: 704982704
  | method: sum descending CPU time: 6.2286e-02 sec   sum: 704982704
  | 
  | >> ./clock_demo 100000 10000
  | Summing array length 100000 with 10000 repeats, ascending
  | Summing array length 100000 with 10000 repeats, descending
  | method:  sum ascending CPU time: 6.2730e-01 sec   sum: 704982704
  | method: sum descending CPU time: 6.1995e-01 sec   sum: 704982704
  `----


PROBLEM 1 CODE: `struct_stride.c' Program
=========================================

  The provided `struct_stride.c' program has a number of TODO items in
  it related to timing several computations and reporting their results.
  It is best to complete these items first and then attempt to answer
  the quiz questions as some questions require running the program and
  observing timing results.

  Use the lab's description of how the `clock()' function works to
  complete TODO items and print the results.

  When completed, the program can be run as show below:
  ,----
  | >> ./struct_stride 
  | usage: ./struct_stride <arr_length> <num_iters>
  | 
  | >> ./struct_stride 10000000 100
  | method: int_field_base CPU time: 1.2345e-01 sec   sum: 0
  | method: arr_field_base CPU time: 1.2345e-01 sec   sum: 0
  | method: int_field_optm CPU time: 1.2345e-01 sec   sum: 0
  | method: arr_field_optm CPU time: 1.2345e-01 sec   sum: 0
  `----

  NOTE: the timing information has intentionally been changed to read
  1.2345e-01 as calculating this timing information is part of the lab.


PROBLEM 1 QUIZ: Timing `struct_stride.c' Runs
=============================================

  NOTE: timing code varies from one machine to the next. The answers
  below were tested on Zaratan and appear to be stable but system load
  may affect the outcome.


Relative Speed of Struct Layouts
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  After adding in calls to `clock()' and code the print times, run the
  `struct_strid' program.

  Run the program with a large array and a modest amount of array
  iterations such as using the following parameters:
  ,----
  | ./struct_stride 6000000 100
  `----

  Examine the times reported.

  Which option below reflects the relative speeds of the
  layout/algorithm combinations?
  ,----
  |   ------SLOWEST--------------------------------------------FASTEST-----
  | - ( ) arr_field_base > arr_field_optm > int_field_base > int_field_optm 
  | - ( ) int_field_base > int_field_optm > arr_field_base > arr_field_optm
  | - ( ) arr_field_base > int_field_base > arr_field_optm > int_field_optm 
  | - ( ) int_field_base > arr_field_base > int_field_optm > arr_field_optm
  `----


Order of Memory Access
~~~~~~~~~~~~~~~~~~~~~~

  Below are several memory layouts of A/B elements to consider.
  -------------------------------------------------------------------------
   Byte Offset  +00  +04  +08  +12  +16       +80  +84  +88  +92  +96      
   LAYOUT1       A0   A1   A2   A3   A4  ...   B0   B1   B2   B3   B4  ... 
  -------------------------------------------------------------------------

  -------------------------------------------------------------------
   Byte Offset  +00  +04  +08  +12  +16  +20  +24  +28  32  +36      
   LAYOUT 2      A0   B0   A1   B1   A2   B2   A3   B3  A4   B4  ... 
  -------------------------------------------------------------------

  For each of following, indicate the best suited option.

  The `int_field_base' approach code that is timed..
  - ( ) Uses memory LAYOUT 1 and visit elements in the order A0, B0, A1,
    B1, A2, B2, etc.
  - ( ) Uses memory LAYOUT 1 and visit elements in the order A0, A1, A2,
    ... B0, B1, B2, etc.
  - ( ) Uses memory LAYOUT 2 and visit elements in the order A0, B0, A1,
    B1, A2, B2, etc.
  - ( ) Uses memory LAYOUT 2 and visit elements in the order A0, A1, A2,
    ... B0, B1, B2, etc.

  The `arr_field_base' approach code that is timed..
  - ( ) Uses memory LAYOUT 1 and visit elements in the order A0, B0, A1,
    B1, A2, B2, etc.
  - ( ) Uses memory LAYOUT 1 and visit elements in the order A0, A1, A2,
    ... B0, B1, B2, etc.
  - ( ) Uses memory LAYOUT 2 and visit elements in the order A0, B0, A1,
    B1, A2, B2, etc.
  - ( ) Uses memory LAYOUT 2 and visit elements in the order A0, A1, A2,
    ... B0, B1, B2, etc.

  The `int_field_optm' approach code that is timed..
  - ( ) Uses memory LAYOUT 1 and visit elements in the order A0, B0, A1,
    B1, A2, B2, etc.
  - ( ) Uses memory LAYOUT 1 and visit elements in the order A0, A1, A2,
    ... B0, B1, B2, etc.
  - ( ) Uses memory LAYOUT 2 and visit elements in the order A0, B0, A1,
    B1, A2, B2, etc.
  - ( ) Uses memory LAYOUT 2 and visit elements in the order A0, A1, A2,
    ... B0, B1, B2, etc.

  The `arr_field_optm' approach code that is timed..
  - ( ) Uses memory LAYOUT 1 and visit elements in the order A0, B0, A1,
    B1, A2, B2, etc.
  - ( ) Uses memory LAYOUT 1 and visit elements in the order A0, A1, A2,
    ... B0, B1, B2, etc.
  - ( ) Uses memory LAYOUT 2 and visit elements in the order A0, B0, A1,
    B1, A2, B2, etc.
  - ( ) Uses memory LAYOUT 2 and visit elements in the order A0, A1, A2,
    ... B0, B1, B2, etc.


int_field_base VS arr_field_base
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Examine the differences between the two types of structs that are used
  in `struct_stride.c' called `int_field_t' and `arr_field_t'.

  Now examine the first 2 code blocks that use these structs called,
  `int_field_base' and `arr_field_base'. Both involve arrays and structs
  which store an equal number of positive and negative
  integers. However, they differ in the overall layout of those
  integers.  Both use loops sum the "a" numbers first then sum the "b"
  numbers, then combine them for the total sum.

  Which of the following are possible explanations for the timing
  difference between `int_field_base' and `arr_field_base'?
  - ( ) `int_field_base' must perform more loop iterations than
    `arr_field_base' which will making it slower.
  - ( ) `arr_field_base' uses more memory to store then number than
    `int_field_base' and this additional memory increases speed.
  - ( ) `int_field_base' has a memory layout that is ABABABAB so when
    adding A elements, there is a "stride" through
    memory. `arr_field_base' has a layout like AAAAABBBBB so adding
    elements has no stride.
  - ( ) `int_field_base' uses struct field access. The assembly
    instructions to access array fields are slower than the assembly
    instructions that access array elements. This makes `arr_field_base'
    faster due to its use of plain integer arrays.


BASE vs OPTM versions
~~~~~~~~~~~~~~~~~~~~~

  The last two layout/algorithm sections are labeled "optm" as they
  perform a simple code transformation from their "base" version.

  Select ALL of the items below that are accomplished with this
  transformation.

  - ( ) Fewer loop checks/increments are needed as there is only one
    loop instead of 2.
  - ( ) The number of loop iterations is lowered for all loops in the
    optm version.
  - ( ) The memory stride is eliminated for the int_field_optm as both
    a/b elements are added each iteration.
  - ( ) The algorithmic complexity is reduced from O(N^2) in the "base"
    to O(N) in the "optm" version.


PROBLEM 2 QUIZ: Questions matsums
=================================

(A) Vector and Matrix
~~~~~~~~~~~~~~~~~~~~~

  Examine the header file `matvec.h' which gives type definitions and
  declares functions associated with a simple matrix and vector type.

  Which of the following best describes how the `matrix_t` type in the
  lab stores numerical matrices?

  - ( ) The 2D matrix is stored as a 2D array in the field data; this
    allows direct access via 2D indexing via `mat.data[i][j]`
  - ( ) The type `matrix_t` is defined as the same as a `int[][]` so the
    2D data can be directly accessed via `mat[i][j]`
  - ( ) The 2D matrix is stored as a 1D array in the mat.data field. 2D
    accesses like `(i,j)` must be translated to 1D locations via the
    formula `mat.data[i*mat.cols) + j]`
  - ( ) The 2D matrix is stored in a hash table with the `mat.data`
    field storing data and a hashing function used to compute the 1D
    location of a 2D element.


(B) Getting and Setting Elements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  For the `matrix_t' and `vector_t' types, convenient means to get and
  set elements is provided. This mechanism is used in the utility
  functions in `matvec_util.c' and defined in `matvec.h'.

  Which of these best describes the means to access matrix elements
  provided and used in the lab?
  - ( ) The functions `MGET(mat,i,j)` and `MSET(mat,i,j,x)` are
    implemented in the header to get and set elements of a `matrix_t`
  - ( ) The macros `MGET(mat,i,j)` and `MSET(mat,i,j,x)` are `#define`'d
    in the header to get and set elements of a `matrix_t`
  - ( ) Instructions are given about what formula and fields to use when
    accessing and changing elements of the matrix.
  - ( ) No easy way is given for to work with the `matrix_t` type or its
    elements.


(C) row_sums vs col_sums Timing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Compile and run the `matsums_main' executable using the provided
  `Makefile' (typing `make' should suffice).


(A) row_sums vs col_sums Timing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  LOG INTO THE MACHINE Zaratan for timing experiments
  - This is the machine for which the times of this code have been tuned
  - Using a different machine for your timing may lead to problems

  Compile and run the `matsums_main' executable using the provided
  `Makefile' (typing `make' should suffice). Note how to run this
  program: pass the row and column sizes as command line parameters.
  ,----
  | >> ./matsums_main
  | usage: ./matsums_main <rows> <cols>
  `----

  Run this program with differing parameters which control the number of
  rows and columns in the matrix which is summed. Study the results,
  including on a large matrix with 18000 rows and 8000 columns.  Note
  any speed differences between the Row and Column sum routines on large
  matrices.

  The result of this timing indicated the following relative speeds of
  these two provided algorithms.

  - ( ) Summing ROWS was SLOWER than summing COLUMNS
  - ( ) Summing ROWS was the SAME speed as summing COLUMNS
  - ( ) Summing ROWS was FASTER than summing COLUMNS
  - ( ) The speed varies wildly from run to run so it is harder to say
    which of ROWS or COLUMNS is faster


(B) row_sums vs col_sums speed
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Examine the source code for the functions `row_sums()' and
  `col_sums()' in the file `matsums_funcs.c'.  Describe why the
  timing/speed differences you observed previously are occurring based
  on features of the source code you see and the layout of the
  `matrix_t' that is being summed. Consider especially the memory layout
  and how the that of elements visited in each approach maps to the
  actual data layout of the matrix data.

  Which of the following best describes a reason for the relative speeds
  of the algorithms?

  - ( ) The two algorithms had equal speed as they perform the same
    number of operations and have the same `O(R*C)` complexity.
  - ( ) Row Summing was Faster because it accessed memory sequentially
    while Column summing has a large stride in memory leading to poor
    cache performance.
  - ( ) Column Summing was Faster because it accessed memory
    sequentially while Row has a stride in memory leading to poor cache
    performance.
  - ( ) Column summing was slower because the Row summing algorithm
    exhibited better processor pipelining features allowing more
    instructions per cycle to be executed.


PROBLEM 2 CODE: opt_col_sums()
==============================

  Complete the function `opt_col_sums()' in file `matsums_funcs.c'. This
  function should have identical behavior to `col_sums()' (it sums the
  columns of the matrix into a provided array). However, it should be
  *optimized* so that it achieves performance near to that of the
  `row_sums()' function. To achieve this, re-arrange the loops to
  iterate as efficiently as possible with respect to the memory
  system. Feel free to ask course staff for hints on how to do this or
  do some online research.

  To time `opt_col_sums()', uncomment relevant blocks in the
  `matsums_main.c' program that are provided for it.

  Run your code on some of the large matrix examples previously used
  such as an 18,000 by 8,000 matrix.  If the speed for the optimized
  column version is close to that of the row sums, try running the test
  cases associated with the code via `make test-code'.

Submission

Follow the instructions at the end of Lab01 if you need a refresher on how to upload your completed exercise zip to Gradescope.


Web Accessibility
Author: Chris Kauffman (profk@umd.edu)
Date: 2026-04-21 Tue 15:48