CSCI 2021 HW10: Timing Code and Superscalar Processors

Due: 11:59pm Tue 04-Apr-2023
Approximately 0.83% of total grade
Homework and Quizzes are open resource/open collaboration. You must submit your own work but you may freely discuss HW topics with other members of the class.

CODE DISTRIBUTION: hw10-code.zip

Download the code distribution
See further setup instructions below

CHANGELOG: Empty

1. Rationale
- Associated Reading / Preparation
- Grading Policy
2. Codepack
3. The time Command
- Sample time Runs
- Timing superscalar_main
4. Important: Run on csel-kh1260-NN
5. What to Understand
6. Questions
7. Optional Information
- Additional Information on time Results
- Use of Function Pointers

1 Rationale

Measuring and understanding program performance is a vital aspect to improving efficiency. The notion of time itself can be tricky to understand in computing. Modern processors often exhibit unintuitive behavior for seemingly simple code which can only be accounted for by their pipelined and superscalar nature: multiple functional units exist in the CPU which can at times be utilized in parallel.

This HW introduces basic timing techniques and analyzes some simple loop variations to illustrate peculiarities of modern processors.

Associated Reading / Preparation

Bryant and O'Hallaron: Ch 4, paricularly the "Aside: State-of-the-art microprpocessor design" on page 471 (3rd ed) which overviews pipelined and superscalar processors
The manual page for the time command which discusses how it is used to measure the amount of time taken by the programs

Grading Policy

Credit for this HW is earned by taking the associated HW Quiz which is linked under Gradescope. The quiz will ask similar questions as those that are present in the QUESTIONS.txt file and those that complete all answers in QUESTIONS.txt should have no trouble with the quiz.

Homework and Quizzes are open resource/open collaboration. You must submit your own work but you may freely discuss HW topics with other members of the class.

See the full policies in the course syllabus.

2 Codepack

The codepack for the lab contains the following files:

File	Description
`QUESTIONS.txt`	Questions to answer
`Makefile`	Makefile to buil the `superscalar_main` program
`superscalar_funcs.c`	Small "algorithms" to time arithmetic operations
`superscalar_main.c`	Main function that allows different algs to be specified on command line
`runall.sh`	Optional script that runs all functions in `superscalar_main`

3 The `time` Command

The Unix time command is used to measure the execution time of a program. It is simple to use: to measure how long prog takes to run, write

> time prog arg1 arg2
...

For example, to see how long GCC takes to complete a compilation of a program, use time as follows.

> time gcc -c superscalar_main.c

real	0m0.052s
user	0m0.035s
sys	0m0.017s

The times reported are as follows

real: "wall clock" time for the program to complete; how long one must wait around for it to finish
user: the amount of time the CPU was busy executing user code
sys: the amount of time the CPU was executing operating system code on behalf of the user program such as reading from files or communicating over the network.

Sample time Runs

The following are sample program runs with times reported on them. A commentary is given explaining some of the reasoning behind the times shown.

################################################################################
# EXAMPLE 2: make (lab10 build)
# Compile the lab10 code. This involves mostly user operations as gcc
# does the brunt of the work. However, some files need to created so
# there is some interaction with the OS kernel.

>> cd lab10-code
>> time make
gcc -Wall -Werror -g  -Og -o superscalar_main superscalar_main.c superscalar_funcs.c

real	0m0.139s
user	0m0.101s
sys	0m0.037s

################################################################################
# EXAMPLE 2: ls -lR /sys > dev/null
# list all files under the /sys directory; don't print them to the
# screen (output into > /dev/null). Some directories can't be accessed
# which is expected. This involves a lot of interactions with the OS
# kernel to get directory listings so the sys time will be higher.

>> time ls -lR /sys > /dev/null
ls: cannot open directory '/sys/fs/bpf': Permission denied
ls: cannot open directory '/sys/fs/pstore': Permission denied
ls: cannot open directory '/sys/kernel/debug': Permission denied
ls: cannot open directory '/sys/kernel/tracing': Permission denied

real	0m0.298s
user	0m0.102s
sys	0m0.195s

################################################################################
# Example 3: 

# Contact google's server's 3 times to see how fast they respond. Most
# of the time this program is waiting on a response so the wall time
# is high and greatly exceeds the sum of the sys and user time.

>> time ping -c 3 google.com
PING google.com(ord38s32-in-x0e.1e100.net (2607:f8b0:4009:81b::200e)) 56 data bytes
64 bytes from ord38s32.1e100.net: icmp_seq=1 ttl=118 time=10.7 ms
64 bytes from ord38s32.1e100.net: icmp_seq=2 ttl=118 time=12.5 ms
64 bytes from ord38s32.1e100.net: icmp_seq=3 ttl=118 time=12.3 ms

--- google.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 10.705/11.833/12.511/0.803 ms

real	0m2.039s
user	0m0.004s
sys	0m0.005s

Timing `superscalar_main`

For this lab, we will mainly be timing runs of the superscalar_main program which can be obtained using the following invocation.

csel-kh1260-01> make                                    # BUILD PROGRAM using provided Makefile
gcc -Wall -g -Og -c superscalar_main.c		
gcc -Wall -g -Og -c superscalar_funcs.c		
gcc -Wall -g -Og -o superscalar_main superscalar_main.o superscalar_funcs.o		

csel-kh1260-01> time ./superscalar_main 1 32 add1_diff  # TIME RUN OF PROGRAM with options specified
add2_diff for 1 * 2^{32} = 4294967296 iterations... complete

real	0m1.071s
user	0m1.040s
sys	0m0.009s

Mostly there should be a direct correspondence between real and user time so either works while little sys computation is required.

4 Important: Run on csel-kh1260-NN

The provided code that you will analyze is meant to demonstrate some interesting facets of hardware. Different hardware will give different results. To make sure your results are consistent with the expectations run your code on csel-kh1260-NN.cselabs.umn.edu machines. These machines are number csel-kh1260-01 to csel-kh1260-20. You can log in using

> ssh kauf0095@csel-kh1260-07.cselabs.umn.edu

or something similar, adjusting the final number from 07 to any number 01 to 20.

Use your X.500 username/password to get access. All CSE labs machines share the same home directory so any code you download to Vole or a physical lab machine will be available on all machines.

Importantly, expect INCONSISTENT RESULTS in the following environments

VOLE: this is a virtualized environment, not real hardware. Results are likely to vary from one run to the next
apollo.cselabs.umn.edu: This is also a virtualized environment and will not produce consistent results.
Personal machines: hardware will vary for you on your own machine. Results may be faster, slower, consistent with those above or very different. It is may be interesting to compare the results on your own machine to those on the test machines.

5 What to Understand

Ensure that you understand

How to time the run of a program using the time command and what measurements it reports
Some specifics about how processor pipelining and superscalar operations can lead to unintuitive run times for seemingly similar codes

6 Questions

Analyze the files in the provided codepack and answer the questions given in QUESTIONS.txt.

                           _________________

                            HW 10 QUESTIONS
                           _________________


Write your answers to the questions below directly in this text file to
prepare for the associated quiz. Credit for the HW is earned by
completing the associated online quiz on Gradescope.


Important: Run on csel-kh1260-NN
================================

  The provided code that you will analyze is meant to demonstrate some
  interesting facets of hardware. Different hardware will give different
  results. To make sure your results are consistent with the
  expectations *run your code on csel-kh1260-NN.cselabs.umn.edu*
  machines. These machines are number `csel-kh1260-01' to
  `csel-kh1260-37'.  You can log in using
  ,----
  | > ssh kauf0095@csel-kh1260-07.cselabs.umn.edu
  `----

  or something similar, adjusting the final number from 07 to any number
  01 to 37.

  Use your X.500 username/password to get access. All CSE labs machines
  share the same home directory so any code you download to Vole or a
  physical lab machine will be available on all machines.

  Importantly, expect INCONSISTENT RESULTS in the following environments
  - VOLE: this is a virtualized environment, not real hardware. Results
    are likely to vary from one run to the next
  - apollo.cselabs.umn.edu: This is also a virtualized environment and
    will not produce consistent results.
  - Personal machines: hardware will vary for you on your own
    machine. Results may be faster, slower, consistent with those above
    or very different. It is may be interesting to compare the results
    on your own machine to those on the test machines.


Compiling and Timing
====================

  Use the provided Makefile to compile as in
  ,----
  | > make
  | gcc -Wall -g -Og -c superscalar_main.c		
  | gcc -Wall -g -Og -c superscalar_funcs.c		
  | gcc -Wall -g -Og -o superscalar_main superscalar_main.o superscalar_funcs.o		
  `----
  The compilation uses `-Og' (debug level optimization) which is
  necessary to stop the compiler from modifying some loops.

  This will produce the `superscalar_main' program which has the
  following usage:
  ,----
  | > ./superscalar_main
  | usage: ./superscalar_main <MULT> <EXP> <ALG>
  |   <MULT> and <ALG> are integers, iterates for MULT * 2^{EXP} iterations
  |   <ALG> is one of
  |          add1_diff : add 1 times in loop
  |          add2_diff : add 2 times in same loop; different destinations
  |          add3_diff : add 3 times in same loop; different destinations
  |          add2_same : add 2 times in same loop; same destinations
  |          mul1_diff : multiply 1 times in loop
  |          mul2_diff : multiply 2 times in same loop; different destinations
  |          mul3_diff : multiply 3 times in same loop; different destinations
  |          mul4_diff : multiply 4 times in same loop; different destinations
  |          mul2_same : multiply 2 times in same loop; same destinations
  | add1_then_mul_diff : add and multiply in different loops; different destinations
  | add1_then_mul_same : add and multiply in different loops; same destinations
  |  add2_and_mul_diff : add twice and multiply in the same loop; different destinations
  |  add2_and_mul_same : add twice and multiply in the same loop; same destination 
  `----

  Experiment with algorithm `add1_diff' and parameters `MULT' and `EXP'
  which control the number of iterations run. Experiment until you get a
  run time of about 1 second such as MULT=1 and EXP=32 on kh-1260-01.
  ,----
  | kh-1260-01> time ./superscalar_main 1 32 add1_diff 
  | add1_diff for 18 * 2^{27} = 2415919104 iterations... complete
  | 
  | real    0m1.010s                                                                                                                                                                                                                  
  | user    0m1.009s                                                                                                                                                                                                                  
  | sys     0m0.000s      
  `----


PROBLEM 1: Addition
===================

(A) add1_diff / add2_diff / add3_diff
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Examine the source code in `superscalar_funcs.c' for the following
  algorithms.
  - add1_diff : add 1 times in loop
  - add2_diff : add 2 times in same loop; different destinations
  - add3_diff : add 3 times in same loop; different destinations
  Note that each does some additions in a loop. Time each of these WITH
  THE SAME MULT/EXP parameters and show your timings. Describe how the
  timings compare and speculate on why.

  Note that all of these involve incrementing a counter (`i++') so the
  number of additions is one greater than the algorithm name suggests.


(B) add2_diff / add2_same
~~~~~~~~~~~~~~~~~~~~~~~~~

  Compare the source code of the two algorithms below.
  - add1_diff : add 1 times in loop
  - add2_diff : add 2 times in same loop; different destinations
  - add2_same : add 2 times in same loop; same destinations
  Note that the only difference between the add2_X algorithms is that
  the destination for the results.

  Compare timings for these and speculate on the reasons for any
  unexpected results.


PROBLEM 2: Multiplication
=========================

(A) add1_diff / mul1_diff
~~~~~~~~~~~~~~~~~~~~~~~~~

  Compare the following two algorithms.
  - add1_diff : add 1 times in loop
  - mul1_diff : multiply 1 times in loop
  Time them on the same parameters and speculate on the timing
  differences.


(B) mul1_diff / mul2_diff / mul3_diff / mul4_diff
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Compare the following two algorithms.
  - mul1_diff : multiply 1 times in loop
  - mul2_diff : multiply 2 times in same loop; different destinations
  - mul3_diff : multiply 3 times in same loop; different destinations
  - mul4_diff : multiply 4 times in same loop; different destinations
  Time them on the same parameters and speculate on the timing
  differences.


(C) mul1_diff / mul2_diff / mul2_same
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Compare the following two algorithms.
  - mul1_diff : multiply 1 times in loop
  - mul2_diff : multiply 2 times in same loop; different destinations
  - mul2_same : multiply 2 times in same loop; same destinations
  Time them on the same parameters and speculate on the timing
  differences.


PROBLEM 3: Combined Addition/Multiplication
===========================================

(A) add1_then_mul_diff / add2_and_mul_diff
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Compare the following two algorithms.
  - add1_then_mul_diff : add and multiply in different loops; different
    destinations
  - add2_and_mul_diff : add twice and multiply in the same loop;
    different destinations
  Note that each loop involves incrementing a counter so both of the
  above algorithms should do the same number of operations for N
  iterations:
  ------------------------------------------------
                       loop        total  overall 
   Algorithm           incr  adds  adds   total   
  ------------------------------------------------
   add1_then_mul_diff  2*N   1*N   3*N    4*N     
   add2_and_mul_diff   1*N   2*N   3*N    4*N     
  ------------------------------------------------

  Time these algorithms on the same parameters and speculate on the
  timing differences.

  Compare the timings to your earlier results for add1_diff and
  mul1_diff.


(B) Table of Timings
~~~~~~~~~~~~~~~~~~~~

  Consider table of algorithm variants given below.  Part A of this
  question compared two of these algorithms. Add their times into the
  table below under the 'User Time' column then time the other two
  algorithms to complete the table of timings. The table will be
  analyzed in the next part.

  ---------------------------------------------
           Memory                         User 
   #Loops  Locations  Algorithm           Time 
  ---------------------------------------------
        2  Different  add1_then_mul_diff       
        1  Different  add2_and_mul_diff        
        2  Same       add1_then_mul_same       
        1  Same       add2_and_mul_same        
  ---------------------------------------------


(C) Table Analysis
~~~~~~~~~~~~~~~~~~

  The table below gives some speed comparisons between the different
  algorithms. See if you can construct some reasons for these
  differences. Note: there isn't a strictly correct answer and in fact
  the relative speeds of these routines may change from one processor to
  the next.

   Alg1                vs  Alg2                Reason 
  ----------------------------------------------------
   add1_then_mul_diff  <   add1_then_mul_same  ??     
   add2_and_mul_diff   <   add2_and_mul_same   ??     
   add1_then_mul_diff  >   add2_and_mul_diff   ??     
   add1_then_mul_same  =   add2_and_mul_same   ??

7 Optional Information

Additional Information on `time` Results

In the examples provided earlier for time, it happens that the real time is the sum of user and sys times.

> time gcc -c superscalar_main.c

real	0m0.052s
user	0m0.035s
sys	0m0.017s

This is not always the case: consider timing a run of ping which contacts a remote machine to determine how fast messages can be exchanged.

> time ping -c 3 google.com
PING google.com(ord38s09-in-x0e.1e100.net (2607:f8b0:4009:816::200e)) 56 data bytes
64 bytes from ord38s09-in-x0e.1e100.net (2607:f8b0:4009:816::200e): icmp_seq=1 ttl=54 time=21.1 ms
64 bytes from ord38s09-in-x0e.1e100.net (2607:f8b0:4009:816::200e): icmp_seq=2 ttl=54 time=29.1 ms
64 bytes from ord38s09-in-x0e.1e100.net (2607:f8b0:4009:816::200e): icmp_seq=3 ttl=55 time=18.3 ms

--- google.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 18.386/22.898/29.192/4.589 ms

real	0m2.095s
user	0m0.004s
sys	0m0.006s

In this case real time to execute the program was about 2 seconds. During that period though, very little user or sys computation was done. Most of the time was spent waiting for the remote machine to respond so that the sum of user and sys is far less than real.

Conversely, real may be less than the sum of user and sys. Consider this timing of make

> time make -j 4
gcc -Wall -g -Og -c superscalar_main.c		
gcc -Wall -g -Og -c superscalar_funcs.c		
gcc -Wall -g -Og -o superscalar_main superscalar_main.o superscalar_funcs.o		

real	0m0.095s
user	0m0.111s
sys	0m0.036s

Note the use of -j 4 which instructs the make program to use up to 4 CPUs to complete jobs that can be done in parallel. This allows several files to be compiled simultaneously leading to a shorter real time than the sum of user and sys.

Use of Function Pointers

The superscalar_main.c and superscalar_funcs.c uses an interesting technique: function pointers. Examining superscalar_funcs.c one will see a long series of functions.

void add1_diff(unsigned long iters, unsigned long *start, unsigned long *delta)
{
  ...
}
void mul1_diff(unsigned long iters, unsigned long *start, unsigned long *delta)
{
  ...
}
...

Each of these has the same prototype: the same return type and parameter types.

The functions are followed by a "table" (array) of structs which includes a pointer to the function (first entry), a string ID associated with the function which is identical to its name (second entry), and a string giving some information on the algorithm.

alg_t algs[] = {
  // func name          string id            description
  {add1_diff,          "add1_diff",          "add 1 times in loop"},
  {add2_diff,          "add2_diff",          "add 2 times in same loop; different destinations"},            
  ...
  {add2_and_mul_diff,  "add2_and_mul_diff",  "add twice and multiply in the same loop; different destinations"},  
  {add2_and_mul_same,  "add2_and_mul_same",  "add twice and multiply in the same loop; same destination "},       
  {NULL,               NULL,                 NULL}
};

The array is "null" terminated in that its last entries are comprised of a struct with all NULL fields making it possible to identify the end of the array without knowing its size.

This creates a convenient way for a main() program to select one of the functions via a string name. This can be done in a loop over the array as is seen in superscalar_main.c.

  char *alg_name = argv[1];
  alg_func = NULL;

  // select algorithm to use
  for(int i=0; algs[i].alg_func != NULL; i++){
    if(strcmp(alg_name, algs[i].name) == 0){
      alg_func = algs[i].alg_func;
    }
  }

After selecting the function to run in the above loop, it is called with parameters later as in the following line.

  alg_func(iters, &start, &delta);         // run the specified algorithm

In this way, one can select from a variety of "algorithms" to run by naming the function to run via command line arguments as in

> ./superscalar_main add1_diff 1 32          # function add1_diff

> ./superscalar_main mul1_diff 1 32          # function mul1_diff

> ./superscalar_main add2_and_mul_diff 1 32  # function add2_and_mul_diff