CMSC216 HW12: Binary I/O, Memory Mapping and pmap
- Due: 11:59pm Mon 09-Dec-2024
- Approximately 0.83% of total grade
- Homework and Quizzes are open resource/open collaboration. You must submit your own work but you may freely discuss HW topics with other members of the class.
CODE DISTRIBUTION: hw12-code.zip
CHANGELOG: Empty
Table of Contents
1 Rationale
Files are often stored in "binary format" for efficiency of storage
and access. Rather than more familiar formatted text formats, these
formats require use of binary file I/O to manipulate them, frequently
low level Unix read() / write()
calls. They also often require
jumping to different positions in the file which can be done via the
lseek()
system call. These are explored in this HW.
A viable alternative to file I/O is to make use of memory mapped files
through mmap()
. This utilizes a system call to expose files as a
pointer into operating system managed space which holds parts of the
file in main memory. While equivalent in power to standard I/O,
mmap()
avoids the need for intermediate buffers and allows pointer
arithmetic to be used to locate and alter the file.
On modern computing systems, virtual memory creates the illusion that
every program has a linear address space from 0 to some large
address. Mostly this happens behind the scenes and is managed by the
operating system but knowledge of presence of virtual addresses
provides insight into many aspects of practical programming. One can
inspect some of the OS information on the virtual address space of a
program using utilities such as pmap
.
Associated Reading / Preparation
Bryant and O'Hallaron Ch 10 covers basic I/O functions like read() /
write()
as well as lseek()
. These functions work equally as well
for text and binary data.
Bryant and O'Hallaron: Ch 9 on Virtual Memory is informative for it's
coverage virtual memory in general. The mmap()
function is discussed
in section 9.8.4. The overview of virtual memory is useful to
understand the output of pmap
.
Grading Policy
Credit for this HW is earned by taking the associated HW Quiz which is
linked under Gradescope
. The quiz will ask similar questions as
those that are present in the QUESTIONS.txt
file and those that
complete all answers in QUESTIONS.txt
should have no trouble with
the quiz.
Homework and Quizzes are open resource/open collaboration. You must submit your own work but you may freely discuss HW topics with other members of the class.
See the full policies in the course syllabus.
2 Codepack
The codepack for the HW contains the following files:
File | Description |
---|---|
QUESTIONS.txt |
Questions to answer |
memory-parts/ |
Directory for Problem 1 |
Makefile |
Makefile to build programs for the HW |
memory_parts.c |
Problem 1 program to analyze |
gettysburg.txt |
Problem 1 data file |
binfiles-mmap/ |
Directory for Problems 2-3 |
Makefile |
Makefile to build Problem 2-3 programs |
department.h |
Header file for programs |
make_dept_directory.c |
Problem 2-3 program to create data file |
cse_depts.dat.bk |
Backup of data file created in Problem 2-3 |
print_department_read.c |
Problem 2 program to analyze |
print_department_mmap.c |
Problem 3 program to analyze |
3 Questions
Analyze the files in the provided codepack and answer the questions
given in QUESTIONS.txt
.
_________________ HW 12 QUESTIONS _________________ Write your answers to the questions below directly in this text file to prepare for the associated quiz. Credit for the HW is earned by completing the associated online quiz on Gradescope. PROBLEM 1: Virtual Memory and pmap ================================== Code for this problem is in the `memory-parts' subdirectory. (A) memory_parts memory areas ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Examine the source code for the provided `memory-parts/memory_parts.c' program. Identify what region of program memory you expect the following variables to be allocated into: - global_arr[] - stack_arr[] - heap_arr (B) Running memory_parts and pmap ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Compile the `memory_parts' using the provided Makefile. ,---- | > make memory_parts `---- Run the program and note that it prints several pieces of information - The addresses of several of the variables allocated - Its Process ID (PID) which is a unique number used to identify the running program. This is an integer. For example, the output might be ,---- | > ./memory-parts | 0x5605a7c271e9 : main() | 0x5605a7c2a0c0 : global_arr | 0x7ffe5ff7d600 : stack_arr | 0x5605a92442a0 : heap_arr | 0x7f1fa7303000 : mmap'd file | 0x600000000000 : mmap'd block1 | 0x600000001000 : mmap'd block2 | my pid is 8406 | press any key to continue `---- so the programs PID is 8406 The program will also stop at this point until a key is pressed. DO NOT PRESS A KEY YET. Open another terminal and type the following command in that new terminal. ,---- | > pmap THE-PID-NUMBER-THAT-WAS-PRINTED-EARLIER `---- Paste the output of pmap below. (C) Program Addresses vs Mapped Addresses ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ pmap prints out the virtual address space table for the program. The leftmost column is a virtual address mapped by the OS for the program to some physical location. The next column is the size of the area of memory associated with that starting address. The 3rd column contains permissions of the program has for the memory area: r for read, w for read, x for execute. The final column is contains any identifying information about the memory area that pmap can discern. Compare the addresses of variables and functions from the paused program to the output. Try to determine the virtual address space in which each variable resides and what region of program memory that virtual address must belong to (stack, heap, globals, text). In some cases, the identifying information provided by pmap may make this obvious. (D) Min Size of Mapped Areas ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The minimum size of any virtual area of memory appears to be 4K. Why is this the case? (E) Additional Observations ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Notice that in addition to the "normal" variables that are mapped, there is also an entry for the mmap()'d file 'gettysburg.txt' in the virtual address table. The mmap() function is explored in the next problem but note its calling sequence which involves use of a couple system calls: 1. `open()' which is a low level file opening call which returns a numeric file descriptor. 2. `fstat()' which obtains information such as size for an open file based on its numeric file descriptor. The `stat() / fstat()' system calls are used to ask the Unix Operating System information about files such as their size, modification times, and access permissions. This system call is studied more in Operating System courses. Finally there are additional calls to `mmap()' which allocate memory to the program at a specific virtual address. Similar code to this is often used to allocate and expand the heap area of memory for programs in implementations of `malloc()'. PROBLEM 2: Binary File Format w/ Read ===================================== (A) ~~~ Compile all programs in the directory `binfiles/' with the provided `Makefile'. Run the command ,---- | ./make_dept_directory cse_depts.dat `---- to create the `cse_depts.dat' binary file. Examine the source code for this program along with the header `department.h'. - What system calls are used in `make_dept_directory.c' to create this file? - How is the `sizeof()' operator used to simplify some of the computations in `make_dept_directory.c'? - What data is in `cse_depts.dat' and how is it ordered? (B) ~~~ Run the `print_department_read' program which takes a binary data file and a department code to print. Show a few examples of running this program with the valid command line arguments. Include in your demo runs that - Use the `cse_depts.dat' with known and unknown department codes - Use a file other than `cse_depts.dat' (C) ~~~ Study the source code for `print_department_read' and describe how it initially prints the table of offsets shown below. ,---- | Dept Name: CS Offset: 104 | Dept Name: EE Offset: 2152 | Dept Name: IT Offset: 3688 `---- What specific sequence of calls leads to this information? (D) ~~~ What system call is used to skip immediately to the location in the file where desired contacts are located? What arguments does this system call take? Consult the manual entry for this function to find out how else it can be used. PROBLEM 3: mmap() and binary files ================================== An alternative to using standard I/O functions is "memory mapped" files through the system call `mmap()'. The program `print_department_mmap.c' provides the functionality as the previous `print_department_read.c' but uses a different mechanism. (A) ~~~ Early in `print_department_mmap.c' an `open()' call is used as in the previous program but it is followed shortly by a call to `mmap()' in the lines ,---- | char *file_bytes = | mmap(NULL, size, PROT_READ, MAP_SHARED, | fd, 0); `---- Look up reference documentation on `mmap()' and describe some of the arguments to it including the `NULL' and `size' arguments. Also describe its return value. (B) ~~~ The initial setup of the program uses `mmap()' to assign a pointer to variable `char *file_bytes'. This pointer will refer directly to the bytes of the binary file. Examine the lines ,---- | //////////////////////////////////////////////////////////////////////////////// | // CHECK the file_header_t struct for integrity, size of department array | file_header_t *header = (file_header_t *) file_bytes; // binary header struct is first thing in the file `---- Explain what is happening here: what value will the variable `header' get and how is it used in subsequent lines. (C) ~~~ After finishing with the file header, the next section of the program begins with the following. ,---- | //////////////////////////////////////////////////////////////////////////////// | // SEARCH the array of department offsets for the department named | // on the command line | | dept_offset_t *offsets = // after file header, array of dept_offset_t structures | (dept_offset_t *) (file_bytes + sizeof(file_header_t)); | `---- Explain what value the `offsets_arr' variable is assigned and how it is used in the remainder of the SEARCH section. (D) ~~~ The final phase of the program begins below ,---- | //////////////////////////////////////////////////////////////////////////////// | // PRINT out all personnel in the specified department | ... | contact_t *dept_contacts = (contact_t *) (file_bytes + offset); `---- Describe what value `dept_contacts' is assigned and how the final phase uses it.