CSCI 2021 HW10: Timing Code and Superscalar Processors
- Due: 11:59pm Tue 04-Apr-2023
- Approximately 0.83% of total grade
- Homework and Quizzes are open resource/open collaboration. You must submit your own work but you may freely discuss HW topics with other members of the class.
CODE DISTRIBUTION: hw10-code.zip
- Download the code distribution
- See further setup instructions below
CHANGELOG: Empty
Table of Contents
1 Rationale
Measuring and understanding program performance is a vital aspect to improving efficiency. The notion of time itself can be tricky to understand in computing. Modern processors often exhibit unintuitive behavior for seemingly simple code which can only be accounted for by their pipelined and superscalar nature: multiple functional units exist in the CPU which can at times be utilized in parallel.
This HW introduces basic timing techniques and analyzes some simple loop variations to illustrate peculiarities of modern processors.
Associated Reading / Preparation
- Bryant and O'Hallaron: Ch 4, paricularly the "Aside: State-of-the-art microprpocessor design" on page 471 (3rd ed) which overviews pipelined and superscalar processors
- The manual page for the
time
command which discusses how it is used to measure the amount of time taken by the programs
Grading Policy
Credit for this HW is earned by taking the associated HW Quiz which is
linked under Gradescope
. The quiz will ask similar questions as
those that are present in the QUESTIONS.txt
file and those that
complete all answers in QUESTIONS.txt
should have no trouble with
the quiz.
Homework and Quizzes are open resource/open collaboration. You must submit your own work but you may freely discuss HW topics with other members of the class.
See the full policies in the course syllabus.
2 Codepack
The codepack for the lab contains the following files:
File | Description |
---|---|
QUESTIONS.txt |
Questions to answer |
Makefile |
Makefile to buil the superscalar_main program |
superscalar_funcs.c |
Small "algorithms" to time arithmetic operations |
superscalar_main.c |
Main function that allows different algs to be specified on command line |
runall.sh |
Optional script that runs all functions in superscalar_main |
3 The time
Command
The Unix time
command is used to measure the execution time of a
program. It is simple to use: to measure how long prog
takes to
run, write
> time prog arg1 arg2 ...
For example, to see how long GCC takes to complete a compilation of a
program, use time
as follows.
> time gcc -c superscalar_main.c real 0m0.052s user 0m0.035s sys 0m0.017s
The times reported are as follows
- real: "wall clock" time for the program to complete; how long one must wait around for it to finish
- user: the amount of time the CPU was busy executing user code
- sys: the amount of time the CPU was executing operating system code on behalf of the user program such as reading from files or communicating over the network.
Sample time Runs
The following are sample program runs with times reported on them. A commentary is given explaining some of the reasoning behind the times shown.
################################################################################ # EXAMPLE 2: make (lab10 build) # Compile the lab10 code. This involves mostly user operations as gcc # does the brunt of the work. However, some files need to created so # there is some interaction with the OS kernel. >> cd lab10-code >> time make gcc -Wall -Werror -g -Og -o superscalar_main superscalar_main.c superscalar_funcs.c real 0m0.139s user 0m0.101s sys 0m0.037s ################################################################################ # EXAMPLE 2: ls -lR /sys > dev/null # list all files under the /sys directory; don't print them to the # screen (output into > /dev/null). Some directories can't be accessed # which is expected. This involves a lot of interactions with the OS # kernel to get directory listings so the sys time will be higher. >> time ls -lR /sys > /dev/null ls: cannot open directory '/sys/fs/bpf': Permission denied ls: cannot open directory '/sys/fs/pstore': Permission denied ls: cannot open directory '/sys/kernel/debug': Permission denied ls: cannot open directory '/sys/kernel/tracing': Permission denied real 0m0.298s user 0m0.102s sys 0m0.195s ################################################################################ # Example 3: # Contact google's server's 3 times to see how fast they respond. Most # of the time this program is waiting on a response so the wall time # is high and greatly exceeds the sum of the sys and user time. >> time ping -c 3 google.com PING google.com(ord38s32-in-x0e.1e100.net (2607:f8b0:4009:81b::200e)) 56 data bytes 64 bytes from ord38s32.1e100.net: icmp_seq=1 ttl=118 time=10.7 ms 64 bytes from ord38s32.1e100.net: icmp_seq=2 ttl=118 time=12.5 ms 64 bytes from ord38s32.1e100.net: icmp_seq=3 ttl=118 time=12.3 ms --- google.com ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2002ms rtt min/avg/max/mdev = 10.705/11.833/12.511/0.803 ms real 0m2.039s user 0m0.004s sys 0m0.005s
Timing superscalar_main
For this lab, we will mainly be timing runs of the superscalar_main
program which can be obtained using the following invocation.
csel-kh1260-01> make # BUILD PROGRAM using provided Makefile gcc -Wall -g -Og -c superscalar_main.c gcc -Wall -g -Og -c superscalar_funcs.c gcc -Wall -g -Og -o superscalar_main superscalar_main.o superscalar_funcs.o csel-kh1260-01> time ./superscalar_main 1 32 add1_diff # TIME RUN OF PROGRAM with options specified add2_diff for 1 * 2^{32} = 4294967296 iterations... complete real 0m1.071s user 0m1.040s sys 0m0.009s
Mostly there should be a direct correspondence between real
and
user
time so either works while little sys
computation is required.
4 Important: Run on csel-kh1260-NN
The provided code that you will analyze is meant to demonstrate some
interesting facets of hardware. Different hardware will give different
results. To make sure your results are consistent with the
expectations run your code on csel-kh1260-NN.cselabs.umn.edu
machines. These machines are number csel-kh1260-01
to csel-kh1260-20
.
You can log in using
> ssh kauf0095@csel-kh1260-07.cselabs.umn.edu
or something similar, adjusting the final number from 07 to any number 01 to 20.
Use your X.500 username/password to get access. All CSE labs machines share the same home directory so any code you download to Vole or a physical lab machine will be available on all machines.
Importantly, expect INCONSISTENT RESULTS in the following environments
- VOLE: this is a virtualized environment, not real hardware. Results are likely to vary from one run to the next
- apollo.cselabs.umn.edu: This is also a virtualized environment and will not produce consistent results.
- Personal machines: hardware will vary for you on your own machine. Results may be faster, slower, consistent with those above or very different. It is may be interesting to compare the results on your own machine to those on the test machines.
5 What to Understand
Ensure that you understand
- How to time the run of a program using the
time
command and what measurements it reports - Some specifics about how processor pipelining and superscalar operations can lead to unintuitive run times for seemingly similar codes
6 Questions
Analyze the files in the provided codepack and answer the questions
given in QUESTIONS.txt
.
_________________ HW 10 QUESTIONS _________________ Write your answers to the questions below directly in this text file to prepare for the associated quiz. Credit for the HW is earned by completing the associated online quiz on Gradescope. Important: Run on csel-kh1260-NN ================================ The provided code that you will analyze is meant to demonstrate some interesting facets of hardware. Different hardware will give different results. To make sure your results are consistent with the expectations *run your code on csel-kh1260-NN.cselabs.umn.edu* machines. These machines are number `csel-kh1260-01' to `csel-kh1260-37'. You can log in using ,---- | > ssh kauf0095@csel-kh1260-07.cselabs.umn.edu `---- or something similar, adjusting the final number from 07 to any number 01 to 37. Use your X.500 username/password to get access. All CSE labs machines share the same home directory so any code you download to Vole or a physical lab machine will be available on all machines. Importantly, expect INCONSISTENT RESULTS in the following environments - VOLE: this is a virtualized environment, not real hardware. Results are likely to vary from one run to the next - apollo.cselabs.umn.edu: This is also a virtualized environment and will not produce consistent results. - Personal machines: hardware will vary for you on your own machine. Results may be faster, slower, consistent with those above or very different. It is may be interesting to compare the results on your own machine to those on the test machines. Compiling and Timing ==================== Use the provided Makefile to compile as in ,---- | > make | gcc -Wall -g -Og -c superscalar_main.c | gcc -Wall -g -Og -c superscalar_funcs.c | gcc -Wall -g -Og -o superscalar_main superscalar_main.o superscalar_funcs.o `---- The compilation uses `-Og' (debug level optimization) which is necessary to stop the compiler from modifying some loops. This will produce the `superscalar_main' program which has the following usage: ,---- | > ./superscalar_main | usage: ./superscalar_main <MULT> <EXP> <ALG> | <MULT> and <ALG> are integers, iterates for MULT * 2^{EXP} iterations | <ALG> is one of | add1_diff : add 1 times in loop | add2_diff : add 2 times in same loop; different destinations | add3_diff : add 3 times in same loop; different destinations | add2_same : add 2 times in same loop; same destinations | mul1_diff : multiply 1 times in loop | mul2_diff : multiply 2 times in same loop; different destinations | mul3_diff : multiply 3 times in same loop; different destinations | mul4_diff : multiply 4 times in same loop; different destinations | mul2_same : multiply 2 times in same loop; same destinations | add1_then_mul_diff : add and multiply in different loops; different destinations | add1_then_mul_same : add and multiply in different loops; same destinations | add2_and_mul_diff : add twice and multiply in the same loop; different destinations | add2_and_mul_same : add twice and multiply in the same loop; same destination `---- Experiment with algorithm `add1_diff' and parameters `MULT' and `EXP' which control the number of iterations run. Experiment until you get a run time of about 1 second such as MULT=1 and EXP=32 on kh-1260-01. ,---- | kh-1260-01> time ./superscalar_main 1 32 add1_diff | add1_diff for 18 * 2^{27} = 2415919104 iterations... complete | | real 0m1.010s | user 0m1.009s | sys 0m0.000s `---- PROBLEM 1: Addition =================== (A) add1_diff / add2_diff / add3_diff ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Examine the source code in `superscalar_funcs.c' for the following algorithms. - add1_diff : add 1 times in loop - add2_diff : add 2 times in same loop; different destinations - add3_diff : add 3 times in same loop; different destinations Note that each does some additions in a loop. Time each of these WITH THE SAME MULT/EXP parameters and show your timings. Describe how the timings compare and speculate on why. Note that all of these involve incrementing a counter (`i++') so the number of additions is one greater than the algorithm name suggests. (B) add2_diff / add2_same ~~~~~~~~~~~~~~~~~~~~~~~~~ Compare the source code of the two algorithms below. - add1_diff : add 1 times in loop - add2_diff : add 2 times in same loop; different destinations - add2_same : add 2 times in same loop; same destinations Note that the only difference between the add2_X algorithms is that the destination for the results. Compare timings for these and speculate on the reasons for any unexpected results. PROBLEM 2: Multiplication ========================= (A) add1_diff / mul1_diff ~~~~~~~~~~~~~~~~~~~~~~~~~ Compare the following two algorithms. - add1_diff : add 1 times in loop - mul1_diff : multiply 1 times in loop Time them on the same parameters and speculate on the timing differences. (B) mul1_diff / mul2_diff / mul3_diff / mul4_diff ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Compare the following two algorithms. - mul1_diff : multiply 1 times in loop - mul2_diff : multiply 2 times in same loop; different destinations - mul3_diff : multiply 3 times in same loop; different destinations - mul4_diff : multiply 4 times in same loop; different destinations Time them on the same parameters and speculate on the timing differences. (C) mul1_diff / mul2_diff / mul2_same ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Compare the following two algorithms. - mul1_diff : multiply 1 times in loop - mul2_diff : multiply 2 times in same loop; different destinations - mul2_same : multiply 2 times in same loop; same destinations Time them on the same parameters and speculate on the timing differences. PROBLEM 3: Combined Addition/Multiplication =========================================== (A) add1_then_mul_diff / add2_and_mul_diff ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Compare the following two algorithms. - add1_then_mul_diff : add and multiply in different loops; different destinations - add2_and_mul_diff : add twice and multiply in the same loop; different destinations Note that each loop involves incrementing a counter so both of the above algorithms should do the same number of operations for N iterations: ------------------------------------------------ loop total overall Algorithm incr adds adds total ------------------------------------------------ add1_then_mul_diff 2*N 1*N 3*N 4*N add2_and_mul_diff 1*N 2*N 3*N 4*N ------------------------------------------------ Time these algorithms on the same parameters and speculate on the timing differences. Compare the timings to your earlier results for add1_diff and mul1_diff. (B) Table of Timings ~~~~~~~~~~~~~~~~~~~~ Consider table of algorithm variants given below. Part A of this question compared two of these algorithms. Add their times into the table below under the 'User Time' column then time the other two algorithms to complete the table of timings. The table will be analyzed in the next part. --------------------------------------------- Memory User #Loops Locations Algorithm Time --------------------------------------------- 2 Different add1_then_mul_diff 1 Different add2_and_mul_diff 2 Same add1_then_mul_same 1 Same add2_and_mul_same --------------------------------------------- (C) Table Analysis ~~~~~~~~~~~~~~~~~~ The table below gives some speed comparisons between the different algorithms. See if you can construct some reasons for these differences. Note: there isn't a strictly correct answer and in fact the relative speeds of these routines may change from one processor to the next. Alg1 vs Alg2 Reason ---------------------------------------------------- add1_then_mul_diff < add1_then_mul_same ?? add2_and_mul_diff < add2_and_mul_same ?? add1_then_mul_diff > add2_and_mul_diff ?? add1_then_mul_same = add2_and_mul_same ??
7 Optional Information
Additional Information on time
Results
In the examples provided earlier for time
, it happens that the real
time is the sum of
user
and sys
times.
> time gcc -c superscalar_main.c real 0m0.052s user 0m0.035s sys 0m0.017s
This is not always the case: consider timing a run of ping
which
contacts a remote machine to determine how fast messages can be
exchanged.
> time ping -c 3 google.com PING google.com(ord38s09-in-x0e.1e100.net (2607:f8b0:4009:816::200e)) 56 data bytes 64 bytes from ord38s09-in-x0e.1e100.net (2607:f8b0:4009:816::200e): icmp_seq=1 ttl=54 time=21.1 ms 64 bytes from ord38s09-in-x0e.1e100.net (2607:f8b0:4009:816::200e): icmp_seq=2 ttl=54 time=29.1 ms 64 bytes from ord38s09-in-x0e.1e100.net (2607:f8b0:4009:816::200e): icmp_seq=3 ttl=55 time=18.3 ms --- google.com ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2000ms rtt min/avg/max/mdev = 18.386/22.898/29.192/4.589 ms real 0m2.095s user 0m0.004s sys 0m0.006s
In this case real
time to execute the program was about 2
seconds. During that period though, very little user
or sys
computation was done. Most of the time was spent waiting for the
remote machine to respond so that the sum of user
and sys
is far
less than real
.
Conversely, real
may be less than the sum of user
and
sys
. Consider this timing of make
> time make -j 4 gcc -Wall -g -Og -c superscalar_main.c gcc -Wall -g -Og -c superscalar_funcs.c gcc -Wall -g -Og -o superscalar_main superscalar_main.o superscalar_funcs.o real 0m0.095s user 0m0.111s sys 0m0.036s
Note the use of -j 4
which instructs the make
program to use up to
4 CPUs to complete jobs that can be done in parallel. This allows
several files to be compiled simultaneously leading to a shorter
real
time than the sum of user
and sys
.
Use of Function Pointers
The superscalar_main.c
and superscalar_funcs.c
uses an
interesting technique: function pointers. Examining
superscalar_funcs.c
one will see a long series of functions.
void add1_diff(unsigned long iters, unsigned long *start, unsigned long *delta) { ... } void mul1_diff(unsigned long iters, unsigned long *start, unsigned long *delta) { ... } ...
Each of these has the same prototype: the same return type and parameter types.
The functions are followed by a "table" (array) of structs which includes a pointer to the function (first entry), a string ID associated with the function which is identical to its name (second entry), and a string giving some information on the algorithm.
alg_t algs[] = { // func name string id description {add1_diff, "add1_diff", "add 1 times in loop"}, {add2_diff, "add2_diff", "add 2 times in same loop; different destinations"}, ... {add2_and_mul_diff, "add2_and_mul_diff", "add twice and multiply in the same loop; different destinations"}, {add2_and_mul_same, "add2_and_mul_same", "add twice and multiply in the same loop; same destination "}, {NULL, NULL, NULL} };
The array is "null" terminated in that its last entries are comprised of a struct with all NULL fields making it possible to identify the end of the array without knowing its size.
This creates a convenient way for a main()
program to select one of
the functions via a string name. This can be done in a loop over the
array as is seen in superscalar_main.c
.
char *alg_name = argv[1]; alg_func = NULL; // select algorithm to use for(int i=0; algs[i].alg_func != NULL; i++){ if(strcmp(alg_name, algs[i].name) == 0){ alg_func = algs[i].alg_func; } }
After selecting the function to run in the above loop, it is called with parameters later as in the following line.
alg_func(iters, &start, &delta); // run the specified algorithm
In this way, one can select from a variety of "algorithms" to run by naming the function to run via command line arguments as in
> ./superscalar_main add1_diff 1 32 # function add1_diff > ./superscalar_main mul1_diff 1 32 # function mul1_diff > ./superscalar_main add2_and_mul_diff 1 32 # function add2_and_mul_diff