Section 5: C Profiling for Performance and Memory Usage¶
In this discussion, we will explore how to measure the performance of a C evaluation program and deal with dynamic memory issues.
1. Logging Into ecelinux
with VS Code¶
Follow the same process as in the last section. Find a free workstation and log into the workstation using your NetID and standard NetID password. Then complete the following steps (described in more detail in the last section):
- Start VS Code
- Use View > Command Palette to execute Remote-SSH: Connect Current Window to Host...
- Enter
netid@ecelinux.ece.cornell.edu
- Use View > Explorer to open folder on
ecelinux
- Use View > Terminal to open terminal on
ecelinux
Now clone the GitHub repo we will be using in this section using the following commands:
1 2 3 4 5 6 | % source setup-ece2400.sh % mkdir -p ${HOME}/ece2400 % cd ${HOME}/ece2400 % git clone git@github.com:cornell-ece2400/ece2400-sec05 sec05 % cd sec05 % tree |
The directory includes the following files:
contains-word-eval.c
: source and main forcontains_word
functionoverview.txt
: text file used as input to evaluation program
2. Warmup: Implement String Equal¶
Take a look at the contains-word-eval.c
source file. You will see five
functions:
strlen_
: return length of stringstrdup_
: duplicate string on the heapstreq_v1
: check if two strings are identicalread_words
: read words from a filecontains_word
: search array of words for a given word
The strlen_
function was discussed in lecture. The strdup_
function
is similar in spirit to the arraydup
function which is part of a zyBook
coding quiz. Both of these functions include a trailing underscore to
avoid a namespace conflict with the corresponding function in the
standard C library. The read_words
function uses the strdup_
function
to create an array of words which serves as the dataset for the
evaluation. The contains_word
function iterates through every word in
this array to see if it can find a match using the streq_v1
function.
Implement the streq_v1
function with the following prototype:
1 | int streq_v1( const char* str0, const char* str1 ); |
The function should check if two given strings (str0
,str1
) are equal.
The function should return 0 if all characters match and zero if any
characters do not match. Use the following algorithm for the version v1
of this function:
- Step 1: Use
strlen_
to determine the length ofstr0
- Step 2: Use
strlen_
to determine the length ofstr1
- Step 3: If the lengths are not equal, then strings do not match, return 0
- Step 4: Iterate through both strings with a for loop to check if each character matches
- Step 5: Return 0 if any characters do not match, otherwise return 1
3. Measuring Execution Time¶
Now assume we want to quantitatively measure the execution time of the
contains_word
function. To do this, we can use the time functions
provided by the C standard library in the sys/time.h
header file. We
can use the gettimeofday
function to get the current time:
This function takes as a parameter a pointer to a struct of type struct
timeval
. It uses call-by-pointer semantics to update this struct with
the current time with a precision of 10s of microseconds. The struct has
two fields: tv_sec
is the number of seconds and tv_usec
is the number
of microseconds since January 1, 1970. We can use gettimeofday
like
this to quantitatively measure how long it takes to run an experiment.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | // Start tracking time struct timeval start; gettimeofday( &start, NULL ); // Run the experiment // ... // Stop tracking time struct timeval end; gettimeofday( &end, NULL ); // Calculate elapsed time double elapsed = ( end.tv_sec - start.tv_sec ) + ( ( end.tv_usec - start.tv_usec ) / 1000000.0 ); printf( "Elapsed time is %fs\n", elapsed ); |
Compile the evaluation program and measure the execution time:
1 2 3 | % cd ${HOME}/ece2400/sec05 % gcc -Wall -o contains-word-eval contains-word-eval.c % ./contains-word-eval |
You will notice that the execution time is very short ... so short that
it is too fast for the resolution of the timer. We need to run a subtrial
in a loop many times to make sure we have a long enough experiment that
we can get a reasonable accurate time measurement. If you look in the
main
function you will see a loop like this:
1 2 3 | for ( int j = 0; j < nsubtrials; j++ ) { result = contains_word( words, num_words, "flower" ); } |
Currently nsubtrials
is set to 1. Try increasing this to 1e3 (i.e.,
1000), 1e4 (i.e., 10000), and then 1e5 (i.e., 100000). Rerun the
experiment each time. Confirm that 1e5 provides a long enough experiment
to get a reasonably accurate time measurement.
Now try running the evaluation program at least five times with
nsubtrials
set to 1e5 and write down the results for each trial. Is the
execution time always the same? If not, why not?
We need to do several trials and then take the average execution time to ensure we can get a good estimate of the execution time. We want to restructure our evaluation program to look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | // Setup the experiment int ntrials = 5; int nsubtrials = 1e5; double elapsed_avg = 0.0; for ( int i = 0; i < ntrials; i++ ) { // Start tracking time struct timeval start; gettimeofday( &start, NULL ); // Run the experiment for ( int j = 0; j < nsubtrials; j++ ) { // ... } // Stop tracking time struct timeval end; gettimeofday( &end, NULL ); // Calculate elapsed time double elapsed = ( end.tv_sec - start.tv_sec ) + ( ( end.tv_usec - start.tv_usec ) / 1000000.0 ); elapsed_avg += elapsed; printf( "Elapsed time for trial %d is %fs\n", i, elapsed ); } // Calculate average elapsed time per trial elapsed_avg = elapsed_avg / ntrials; printf( "Average elapsed time is %fs\n", elapsed_avg ); |
Set the ntrials
to be 5 and the nsubtrials
to be 1e5. Recompile and
rerun the experiment. Now experiment with turning optimizations on with
the -O3
command line option like this:
1 2 3 | % cd ${HOME}/ece2400/sec05 % gcc -Wall -O3 -o contains-word-eval contains-word-eval.c % ./contains-word-eval |
Compare the performance of this experiment with and without optimizations.
4. Profiling Execution Time¶
The previous section enables us to measure the overall execution time,
but we might also be interested to know which functions are taking the
most time. This can help us focus on the important hotspots for
optimization. We can use profiling to do this kind of performance
analysis. We will be using the perf
tool to do this kind of profiling.
It is important to turn debugging on when doing profiling so we can track
the function call stacks. Combining the debugging (-g
) with
optimizations (-O3
) will not slow down your program, but may
occasionally produce some confusing situations when debugging or
profiling since some of the original program might have been optimized
away.
Let's start by recompiling our program with support for debugging:
1 2 | % cd ${HOME}/ece2400/sec05 % gcc -Wall -g -o contains-word-eval contains-word-eval.c |
We can now use perf
to profile our program:
1 2 | % cd ${HOME}/ece2400/sec05 % perf record --call-graph dwarf ./contains-word-eval |
Do not worry if you see a warning about lost chunks! As long as only
a few chunks are lost, then you will still have pretty accurate
performance profiling information. perf
works by interrupts your
program several times throughout the execution to take a "sample". During
each sample, perf
will stop your program and examine the stack to see
the list of function stack frames that are currently allocated on the
stack. perf
will keep counters of every call stack it sees. After
perf
is done it will output the total number samples. If this number is
too small then the profiling information may not be very accurate. We can
display this information like this:
1 2 3 4 5 6 7 8 9 | % cd ${HOME}/ece2400/sec05 % perf script report stackcollapse contains-word-e;contains_word;streq_v1 22 contains-word-e;main 2 contains-word-e;main;contains_word 1713 contains-word-e;main;contains_word;streq_v1 1614 contains-word-e;main;contains_word;streq_v1;strlen_ 14400 contains-word-e;main;read_words;strdup_;__GI___libc_malloc;_int_malloc 1 contains-word-e;streq_v1;strlen_ 1 |
There is one line for every unique call stack which was sampled. The line
shows the name of the call stack and how many of the samples were taken
with that call stack. In the above example, 1713 samples where taken with
the call stack: main
calling contains_word
. perf
can also provide a
more interactive view to explore this profile information using the perf
report
command but to be honest, I find it incredibly confusing. In this
course, we will instead make use of a really wonderful visualization
called a flame graph. You can generate a flame graph like this:
1 2 | % cd ${HOME}/ece2400/sec05 % perf script report stackcollapse | flamegraph.pl > graph.svg |
To view the generated graph.svg
, find this file using the explorer in
VS Code. Then right click and choose Download. Then open graph.svg
on
the local workstation using Google Chrome.
In a flame graph the x-axis shows the stack profile population, sorted alphabetically and the y-axis shows the stack depth counting form zero at the bottom. Each rectangle represents a stack frame. The wider a frame is is, the more often it was present in the stacks. The top edge shows what function was executing when the sample was taken, and beneath it is its ancestry. The colors are usually not significant, picked randomly to differentiate frames. See this great article for more on flame graphs:
If you study the flame graph you should be able to observe that almost
all of the time is spent in contains_word
. Within contains_word
,
almost all of the time is spent in streq_v1
. And within streq_v1
,
almost all of the time is spent in strlen_
. We will learn more about
optimization techniques later in the semester, but at a high level, one
of the best ways to improve performance is to avoid calling an
unnecessary function. We can optimize our streq
function to completely
avoid calling strlen_
like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | int streq_v2( const char* str0, const char* str1 ) { int i = 0; int j = 0; // Iterate until reach first null terminator while ( (str0[i] != '\0') && (str1[j] != '\0') ) { if ( str0[i] != str1[j] ) return 0; i++; j++; } // Final check to make sure final characters are equal if ( str0[i] != str1[j] ) return 0; return 1; } |
Copy this version into your contains-word-eval.c
file and change the
contains_word
function to call version v2 instead of version v1.
Recompile and rerun the experiment. Calculate the speedup using this
equation:
Where T_{unopt} is the unoptimized execution time and T_{opt} is the optimized execution time.
5. Profiling Memory Usage¶
The previous sections focused on performance, but we also care about
memory usage, particularly heap space usage. We can track how many bytes
are allocated in each call to malloc
and how many bytes are deallocated
in each call to free
to measure heap space usage. Take a close look at
the strdup_
function and uncomment this line:
1 | g_heap_space_usage += len; |
This line is incrementing a global counter that keeps track of the number
of bytes allocated on the heap. Also uncomment this line in main
:
1 | printf( "Heap space usage is %d bytes\n", g_heap_space_bytes ); |
Then recompile and rerun the evaluation program. The evaluation program should now print out the heap space usage in bytes. In this course, we care about maximum stack space usage (harder to measure), the heap space usage of a data structure itself, and the auxillary heap space usage of a specific function. Auxillary heap space usage is the maximum additional heap space usage of a function beyond the heap space usage of any parameters.
6. Debugging Memory Usage¶
Once we start using pointers and dynamic memory allocation in our C programs, we can encounter a whole new class of bugs related to memory accesses. Valgrind is a tool specifically designed for memory debugging and memory leak detection. You can run Valgrind like this:
1 2 3 4 | % cd ${HOME}/ece2400/sec05 % gcc -Wall -g -o contains-word-eval contains-word-eval.c % valgrind --trace-children=yes --leak-check=full \ --error-exitcode=1 --undef-value-errors=no ./contains-word-eval |
Those are quite a few command line options to Valgrind, so we have created an ece2400-valgrind script. This script is just a simple wrapper which calls Valgrind with the right options. Keep in mind that Valgrind can slow your program down by a factor of 10x.
1 2 3 | % cd ${HOME}/ece2400/sec05 % gcc -Wall -g -o contains-word-eval contains-word-eval.c % ece2400-valgrind ./contains-word-eval |
You will notice that Valgrind is reporting a read/write to an illegal memory locations. It will identify the line in the code where the invalid read/write occurs:
1 2 3 4 | ==51022== Invalid write of size 1 ==51022== at 0x40123D: strdup_ (contains-word-eval.c:55) ==51022== by 0x4013C2: read_words (contains-word-eval.c:113) ==51022== by 0x4014B4: main (contains-word-eval.c:145) |
If you lookin the strdup_
function you will see this:
1 | new_str[len] = '\0'; |
So Valgrind is telling us that this write is illegal. We are writing to
an element in the array that is out of bounds which will cause undefined
behavior. Can you figure out the bug in strdup_
? Try fixing it and
rerun Valgrind to confirm that these illegal read/writes are now gone.
You will notice that Valgrind is reporting a memory leak. It will identify the line in the code where the memory was allocated and then never deallocated:
1 2 3 4 5 | ==52659== 2,313 bytes in 307 blocks are definitely lost in loss record 2 of 2 ==52659== at 0x4C29F73: malloc (vg_replace_malloc.c:309) ==52659== by 0x4011FA: strdup_ (contains-word-eval.c:35) ==52659== by 0x4013C5: read_words (contains-word-eval.c:113) ==52659== by 0x4014B7: main (contains-word-eval.c:145) |
If you look in the read_words
function you will see this:
1 | words[i] = strdup_( strdup_(buf) ); |
Valgrind is telling us that the memory allocated by malloc
in strdup_
called from here are somehow never being properly deallocated. Note that
the program is calling free
for each word in the words
array at the
bottom of the main
function:
1 2 | for ( int i = 0; i < num_words; i++ ) free( words[i] ); |
So the issue is more subtle than just needing to call free
at the end
of main
. Can you figure out where the memory leak is coming from in
read_words
? Try fixing it and rerun Valgrind to confirm that the memory
leak is gone.
7. Experimenting with Profiling for PA2¶
Let's experiment with profiling for the second programming assignment using what we have learned in this discussion section. You can use the following steps to update your PA repo to make sure you have received the latest code.
1 2 3 4 | % mkdir -p ${HOME}/ece2400 % cd ${HOME}/ece2400/netid % git pull % tree |
where netid
is your NetID. If you receive a warning from git, you might
need to commit your local changes before pulling the latest code.
Take a look at the evaluation programs for PA2 in pa2-dstruct/eval
to
see how they use the evaluation strategy described in this discussion
section. Instead of using gettimeofday
directly, we use the
ece2400_timer_reset
and ece2400_timer_get_elapsed
functions in
ece2400-stdlib.h
. So running an experiment now looks like this:
1 2 3 4 5 6 7 8 9 10 | // Start tracking time ece2400_timer_reset(); // Run the experiment // ... // Stop tracking time double elapsed = ece2400_timer_get_elapsed(); printf( "Elapsed time is %fs\n", elapsed ); |
Remember to always use an evaluation build when running experiments. An evaluation build will turn off all debugging support and turn on all optimizations. Try an evaluation build for PA2:
1 2 3 4 5 6 | % cd ${HOME}/ece2400/netid/pa2-dstruct % mkdir build-eval % cd build-eval % cmake -DCMAKE_BUILD_TYPE=eval .. % make eval % ./list-push-back-eval |
Manually tracking how many bytes are allocated and then deallocated on
the stack can be tedious and error prone. So we provide
ece2400_malloc
and ece2400_free
functions in ece2400-stdlib.h
which
automatically handle tracking how much space has been allocated on the
stack. So for example, measuring the auxillary heap space usage in a call
to function foo
would look like this:
1 2 3 4 5 6 7 8 9 | // Start tracking auxillary heap space usage ece2400_mem_reset(); // Call function foo which might allocate space on the heap foo(); // Stop tracking auxillary heap space usage int heap_usage = ece2400_mem_get_aux_usage(); printf( "Aux heap space usage is %d bytes\n", heap_usage ); |
Using Valgrid is absolutely critical for detecting subtle memory usage bugs especially on the heap. We have setup CMake with a target to automate running all of your tests using Valgrind. Here is an example of using Valgrind:
1 2 3 4 5 6 | % cd ${HOME}/ece2400/netid/pa2-dstruct % mkdir build % cd build % cmake .. % make check % make memcheck |
Your PAs will now need to always pass both check
and memcheck
.
8. To-Do On Your Own¶
If you have time, try to further optimize the contains_word
function.
One possible idea is to precompute the length of each word in the words
array before we start timing the experiment. Store these lengths in a new
array of integers which is then passed into contains_word
. Calculate
the length of the given word once. Then you can optimize contains_word
to do a quick length check such that the full character-by-character
check is only used when two words are the same length. Here is an example
implementation of contains_word
that uses this optimization:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | int contains_word( char** words, int* words_len, int num_words, const char* word ) { int word_len = strlen_(word); for ( int i = 0; i < num_words; i++ ) { if ( words_len[i] == word_len ) { if ( streq_v2( words[i], word ) ) return 1; } } return 0; } |
This optimization is reasonable if our workload does many matches on a
single array of words, so we can amortize the time spent on calculating
the length of each word over many calls to contains_word
. What is the
speedup of using this optimization compared to the previous best
performing version?