CSCI 2021 HW10: Timing Code and Superscalar Processors
- Due: 11:59pm Tue 04-Apr-2023
- Approximately 0.83% of total grade
- Homework and Quizzes are open resource/open collaboration. You must submit your own work but you may freely discuss HW topics with other members of the class.
CODE DISTRIBUTION: hw10-code.zip
- Download the code distribution
- See further setup instructions below
CHANGELOG: Empty
Table of Contents
1 Rationale
Measuring and understanding program performance is a vital aspect to improving efficiency. The notion of time itself can be tricky to understand in computing. Modern processors often exhibit unintuitive behavior for seemingly simple code which can only be accounted for by their pipelined and superscalar nature: multiple functional units exist in the CPU which can at times be utilized in parallel.
This HW introduces basic timing techniques and analyzes some simple loop variations to illustrate peculiarities of modern processors.
Associated Reading / Preparation
- Bryant and O'Hallaron: Ch 4, paricularly the "Aside: State-of-the-art microprpocessor design" on page 471 (3rd ed) which overviews pipelined and superscalar processors
- The manual page for the
timecommand which discusses how it is used to measure the amount of time taken by the programs
Grading Policy
Credit for this HW is earned by taking the associated HW Quiz which is
linked under Gradescope. The quiz will ask similar questions as
those that are present in the QUESTIONS.txt file and those that
complete all answers in QUESTIONS.txt should have no trouble with
the quiz.
Homework and Quizzes are open resource/open collaboration. You must submit your own work but you may freely discuss HW topics with other members of the class.
See the full policies in the course syllabus.
2 Codepack
The codepack for the lab contains the following files:
| File | Description |
|---|---|
QUESTIONS.txt |
Questions to answer |
Makefile |
Makefile to buil the superscalar_main program |
superscalar_funcs.c |
Small "algorithms" to time arithmetic operations |
superscalar_main.c |
Main function that allows different algs to be specified on command line |
runall.sh |
Optional script that runs all functions in superscalar_main |
3 The time Command
The Unix time command is used to measure the execution time of a
program. It is simple to use: to measure how long prog takes to
run, write
> time prog arg1 arg2 ...
For example, to see how long GCC takes to complete a compilation of a
program, use time as follows.
> time gcc -c superscalar_main.c real 0m0.052s user 0m0.035s sys 0m0.017s
The times reported are as follows
- real: "wall clock" time for the program to complete; how long one must wait around for it to finish
- user: the amount of time the CPU was busy executing user code
- sys: the amount of time the CPU was executing operating system code on behalf of the user program such as reading from files or communicating over the network.
Sample time Runs
The following are sample program runs with times reported on them. A commentary is given explaining some of the reasoning behind the times shown.
################################################################################ # EXAMPLE 2: make (lab10 build) # Compile the lab10 code. This involves mostly user operations as gcc # does the brunt of the work. However, some files need to created so # there is some interaction with the OS kernel. >> cd lab10-code >> time make gcc -Wall -Werror -g -Og -o superscalar_main superscalar_main.c superscalar_funcs.c real 0m0.139s user 0m0.101s sys 0m0.037s ################################################################################ # EXAMPLE 2: ls -lR /sys > dev/null # list all files under the /sys directory; don't print them to the # screen (output into > /dev/null). Some directories can't be accessed # which is expected. This involves a lot of interactions with the OS # kernel to get directory listings so the sys time will be higher. >> time ls -lR /sys > /dev/null ls: cannot open directory '/sys/fs/bpf': Permission denied ls: cannot open directory '/sys/fs/pstore': Permission denied ls: cannot open directory '/sys/kernel/debug': Permission denied ls: cannot open directory '/sys/kernel/tracing': Permission denied real 0m0.298s user 0m0.102s sys 0m0.195s ################################################################################ # Example 3: # Contact google's server's 3 times to see how fast they respond. Most # of the time this program is waiting on a response so the wall time # is high and greatly exceeds the sum of the sys and user time. >> time ping -c 3 google.com PING google.com(ord38s32-in-x0e.1e100.net (2607:f8b0:4009:81b::200e)) 56 data bytes 64 bytes from ord38s32.1e100.net: icmp_seq=1 ttl=118 time=10.7 ms 64 bytes from ord38s32.1e100.net: icmp_seq=2 ttl=118 time=12.5 ms 64 bytes from ord38s32.1e100.net: icmp_seq=3 ttl=118 time=12.3 ms --- google.com ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2002ms rtt min/avg/max/mdev = 10.705/11.833/12.511/0.803 ms real 0m2.039s user 0m0.004s sys 0m0.005s
Timing superscalar_main
For this lab, we will mainly be timing runs of the superscalar_main
program which can be obtained using the following invocation.
csel-kh1260-01> make # BUILD PROGRAM using provided Makefile
gcc -Wall -g -Og -c superscalar_main.c
gcc -Wall -g -Og -c superscalar_funcs.c
gcc -Wall -g -Og -o superscalar_main superscalar_main.o superscalar_funcs.o
csel-kh1260-01> time ./superscalar_main 1 32 add1_diff # TIME RUN OF PROGRAM with options specified
add2_diff for 1 * 2^{32} = 4294967296 iterations... complete
real 0m1.071s
user 0m1.040s
sys 0m0.009s
Mostly there should be a direct correspondence between real and
user time so either works while little sys computation is required.
4 Important: Run on csel-kh1260-NN
The provided code that you will analyze is meant to demonstrate some
interesting facets of hardware. Different hardware will give different
results. To make sure your results are consistent with the
expectations run your code on csel-kh1260-NN.cselabs.umn.edu
machines. These machines are number csel-kh1260-01 to csel-kh1260-20.
You can log in using
> ssh kauf0095@csel-kh1260-07.cselabs.umn.edu
or something similar, adjusting the final number from 07 to any number 01 to 20.
Use your X.500 username/password to get access. All CSE labs machines share the same home directory so any code you download to Vole or a physical lab machine will be available on all machines.
Importantly, expect INCONSISTENT RESULTS in the following environments
- VOLE: this is a virtualized environment, not real hardware. Results are likely to vary from one run to the next
- apollo.cselabs.umn.edu: This is also a virtualized environment and will not produce consistent results.
- Personal machines: hardware will vary for you on your own machine. Results may be faster, slower, consistent with those above or very different. It is may be interesting to compare the results on your own machine to those on the test machines.
5 What to Understand
Ensure that you understand
- How to time the run of a program using the
timecommand and what measurements it reports - Some specifics about how processor pipelining and superscalar operations can lead to unintuitive run times for seemingly similar codes
6 Questions
Analyze the files in the provided codepack and answer the questions
given in QUESTIONS.txt.
_________________
HW 10 QUESTIONS
_________________
Write your answers to the questions below directly in this text file to
prepare for the associated quiz. Credit for the HW is earned by
completing the associated online quiz on Gradescope.
Important: Run on csel-kh1260-NN
================================
The provided code that you will analyze is meant to demonstrate some
interesting facets of hardware. Different hardware will give different
results. To make sure your results are consistent with the
expectations *run your code on csel-kh1260-NN.cselabs.umn.edu*
machines. These machines are number `csel-kh1260-01' to
`csel-kh1260-37'. You can log in using
,----
| > ssh kauf0095@csel-kh1260-07.cselabs.umn.edu
`----
or something similar, adjusting the final number from 07 to any number
01 to 37.
Use your X.500 username/password to get access. All CSE labs machines
share the same home directory so any code you download to Vole or a
physical lab machine will be available on all machines.
Importantly, expect INCONSISTENT RESULTS in the following environments
- VOLE: this is a virtualized environment, not real hardware. Results
are likely to vary from one run to the next
- apollo.cselabs.umn.edu: This is also a virtualized environment and
will not produce consistent results.
- Personal machines: hardware will vary for you on your own
machine. Results may be faster, slower, consistent with those above
or very different. It is may be interesting to compare the results
on your own machine to those on the test machines.
Compiling and Timing
====================
Use the provided Makefile to compile as in
,----
| > make
| gcc -Wall -g -Og -c superscalar_main.c
| gcc -Wall -g -Og -c superscalar_funcs.c
| gcc -Wall -g -Og -o superscalar_main superscalar_main.o superscalar_funcs.o
`----
The compilation uses `-Og' (debug level optimization) which is
necessary to stop the compiler from modifying some loops.
This will produce the `superscalar_main' program which has the
following usage:
,----
| > ./superscalar_main
| usage: ./superscalar_main <MULT> <EXP> <ALG>
| <MULT> and <ALG> are integers, iterates for MULT * 2^{EXP} iterations
| <ALG> is one of
| add1_diff : add 1 times in loop
| add2_diff : add 2 times in same loop; different destinations
| add3_diff : add 3 times in same loop; different destinations
| add2_same : add 2 times in same loop; same destinations
| mul1_diff : multiply 1 times in loop
| mul2_diff : multiply 2 times in same loop; different destinations
| mul3_diff : multiply 3 times in same loop; different destinations
| mul4_diff : multiply 4 times in same loop; different destinations
| mul2_same : multiply 2 times in same loop; same destinations
| add1_then_mul_diff : add and multiply in different loops; different destinations
| add1_then_mul_same : add and multiply in different loops; same destinations
| add2_and_mul_diff : add twice and multiply in the same loop; different destinations
| add2_and_mul_same : add twice and multiply in the same loop; same destination
`----
Experiment with algorithm `add1_diff' and parameters `MULT' and `EXP'
which control the number of iterations run. Experiment until you get a
run time of about 1 second such as MULT=1 and EXP=32 on kh-1260-01.
,----
| kh-1260-01> time ./superscalar_main 1 32 add1_diff
| add1_diff for 18 * 2^{27} = 2415919104 iterations... complete
|
| real 0m1.010s
| user 0m1.009s
| sys 0m0.000s
`----
PROBLEM 1: Addition
===================
(A) add1_diff / add2_diff / add3_diff
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Examine the source code in `superscalar_funcs.c' for the following
algorithms.
- add1_diff : add 1 times in loop
- add2_diff : add 2 times in same loop; different destinations
- add3_diff : add 3 times in same loop; different destinations
Note that each does some additions in a loop. Time each of these WITH
THE SAME MULT/EXP parameters and show your timings. Describe how the
timings compare and speculate on why.
Note that all of these involve incrementing a counter (`i++') so the
number of additions is one greater than the algorithm name suggests.
(B) add2_diff / add2_same
~~~~~~~~~~~~~~~~~~~~~~~~~
Compare the source code of the two algorithms below.
- add1_diff : add 1 times in loop
- add2_diff : add 2 times in same loop; different destinations
- add2_same : add 2 times in same loop; same destinations
Note that the only difference between the add2_X algorithms is that
the destination for the results.
Compare timings for these and speculate on the reasons for any
unexpected results.
PROBLEM 2: Multiplication
=========================
(A) add1_diff / mul1_diff
~~~~~~~~~~~~~~~~~~~~~~~~~
Compare the following two algorithms.
- add1_diff : add 1 times in loop
- mul1_diff : multiply 1 times in loop
Time them on the same parameters and speculate on the timing
differences.
(B) mul1_diff / mul2_diff / mul3_diff / mul4_diff
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compare the following two algorithms.
- mul1_diff : multiply 1 times in loop
- mul2_diff : multiply 2 times in same loop; different destinations
- mul3_diff : multiply 3 times in same loop; different destinations
- mul4_diff : multiply 4 times in same loop; different destinations
Time them on the same parameters and speculate on the timing
differences.
(C) mul1_diff / mul2_diff / mul2_same
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compare the following two algorithms.
- mul1_diff : multiply 1 times in loop
- mul2_diff : multiply 2 times in same loop; different destinations
- mul2_same : multiply 2 times in same loop; same destinations
Time them on the same parameters and speculate on the timing
differences.
PROBLEM 3: Combined Addition/Multiplication
===========================================
(A) add1_then_mul_diff / add2_and_mul_diff
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compare the following two algorithms.
- add1_then_mul_diff : add and multiply in different loops; different
destinations
- add2_and_mul_diff : add twice and multiply in the same loop;
different destinations
Note that each loop involves incrementing a counter so both of the
above algorithms should do the same number of operations for N
iterations:
------------------------------------------------
loop total overall
Algorithm incr adds adds total
------------------------------------------------
add1_then_mul_diff 2*N 1*N 3*N 4*N
add2_and_mul_diff 1*N 2*N 3*N 4*N
------------------------------------------------
Time these algorithms on the same parameters and speculate on the
timing differences.
Compare the timings to your earlier results for add1_diff and
mul1_diff.
(B) Table of Timings
~~~~~~~~~~~~~~~~~~~~
Consider table of algorithm variants given below. Part A of this
question compared two of these algorithms. Add their times into the
table below under the 'User Time' column then time the other two
algorithms to complete the table of timings. The table will be
analyzed in the next part.
---------------------------------------------
Memory User
#Loops Locations Algorithm Time
---------------------------------------------
2 Different add1_then_mul_diff
1 Different add2_and_mul_diff
2 Same add1_then_mul_same
1 Same add2_and_mul_same
---------------------------------------------
(C) Table Analysis
~~~~~~~~~~~~~~~~~~
The table below gives some speed comparisons between the different
algorithms. See if you can construct some reasons for these
differences. Note: there isn't a strictly correct answer and in fact
the relative speeds of these routines may change from one processor to
the next.
Alg1 vs Alg2 Reason
----------------------------------------------------
add1_then_mul_diff < add1_then_mul_same ??
add2_and_mul_diff < add2_and_mul_same ??
add1_then_mul_diff > add2_and_mul_diff ??
add1_then_mul_same = add2_and_mul_same ??
7 Optional Information
Additional Information on time Results
In the examples provided earlier for time, it happens that the real time is the sum of
user and sys times.
> time gcc -c superscalar_main.c real 0m0.052s user 0m0.035s sys 0m0.017s
This is not always the case: consider timing a run of ping which
contacts a remote machine to determine how fast messages can be
exchanged.
> time ping -c 3 google.com PING google.com(ord38s09-in-x0e.1e100.net (2607:f8b0:4009:816::200e)) 56 data bytes 64 bytes from ord38s09-in-x0e.1e100.net (2607:f8b0:4009:816::200e): icmp_seq=1 ttl=54 time=21.1 ms 64 bytes from ord38s09-in-x0e.1e100.net (2607:f8b0:4009:816::200e): icmp_seq=2 ttl=54 time=29.1 ms 64 bytes from ord38s09-in-x0e.1e100.net (2607:f8b0:4009:816::200e): icmp_seq=3 ttl=55 time=18.3 ms --- google.com ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2000ms rtt min/avg/max/mdev = 18.386/22.898/29.192/4.589 ms real 0m2.095s user 0m0.004s sys 0m0.006s
In this case real time to execute the program was about 2
seconds. During that period though, very little user or sys
computation was done. Most of the time was spent waiting for the
remote machine to respond so that the sum of user and sys is far
less than real.
Conversely, real may be less than the sum of user and
sys. Consider this timing of make
> time make -j 4 gcc -Wall -g -Og -c superscalar_main.c gcc -Wall -g -Og -c superscalar_funcs.c gcc -Wall -g -Og -o superscalar_main superscalar_main.o superscalar_funcs.o real 0m0.095s user 0m0.111s sys 0m0.036s
Note the use of -j 4 which instructs the make program to use up to
4 CPUs to complete jobs that can be done in parallel. This allows
several files to be compiled simultaneously leading to a shorter
real time than the sum of user and sys.
Use of Function Pointers
The superscalar_main.c and superscalar_funcs.c uses an
interesting technique: function pointers. Examining
superscalar_funcs.c one will see a long series of functions.
void add1_diff(unsigned long iters, unsigned long *start, unsigned long *delta)
{
...
}
void mul1_diff(unsigned long iters, unsigned long *start, unsigned long *delta)
{
...
}
...
Each of these has the same prototype: the same return type and parameter types.
The functions are followed by a "table" (array) of structs which includes a pointer to the function (first entry), a string ID associated with the function which is identical to its name (second entry), and a string giving some information on the algorithm.
alg_t algs[] = {
// func name string id description
{add1_diff, "add1_diff", "add 1 times in loop"},
{add2_diff, "add2_diff", "add 2 times in same loop; different destinations"},
...
{add2_and_mul_diff, "add2_and_mul_diff", "add twice and multiply in the same loop; different destinations"},
{add2_and_mul_same, "add2_and_mul_same", "add twice and multiply in the same loop; same destination "},
{NULL, NULL, NULL}
};
The array is "null" terminated in that its last entries are comprised of a struct with all NULL fields making it possible to identify the end of the array without knowing its size.
This creates a convenient way for a main() program to select one of
the functions via a string name. This can be done in a loop over the
array as is seen in superscalar_main.c.
char *alg_name = argv[1];
alg_func = NULL;
// select algorithm to use
for(int i=0; algs[i].alg_func != NULL; i++){
if(strcmp(alg_name, algs[i].name) == 0){
alg_func = algs[i].alg_func;
}
}
After selecting the function to run in the above loop, it is called with parameters later as in the following line.
alg_func(iters, &start, &delta); // run the specified algorithm
In this way, one can select from a variety of "algorithms" to run by naming the function to run via command line arguments as in
> ./superscalar_main add1_diff 1 32 # function add1_diff > ./superscalar_main mul1_diff 1 32 # function mul1_diff > ./superscalar_main add2_and_mul_diff 1 32 # function add2_and_mul_diff