Skip to content

Section 5: C Profiling for Performance and Memory Usage

In this discussion, we will explore how to measure the performance of a C evaluation program and deal with dynamic memory issues.

1. Logging Into ecelinux with VS Code

Follow the same process as in the last section. Find a free workstation and log into the workstation using your NetID and standard NetID password. Then complete the following steps (described in more detail in the last section):

  • Start VS Code
  • Use View > Command Palette to execute Remote-SSH: Connect Current Window to Host...
  • Enter netid@ecelinux.ece.cornell.edu
  • Use View > Explorer to open folder on ecelinux
  • Use View > Terminal to open terminal on ecelinux

Now clone the GitHub repo we will be using in this section using the following commands:

1
2
3
4
5
6
% source setup-ece2400.sh
% mkdir -p ${HOME}/ece2400
% cd ${HOME}/ece2400
% git clone git@github.com:cornell-ece2400/ece2400-sec05 sec05
% cd sec05
% tree

The directory includes the following files:

  • contains-word-eval.c : source and main for contains_word function
  • overview.txt : text file used as input to evaluation program

2. Warmup: Implement String Equal

Take a look at the contains-word-eval.c source file. You will see five functions:

  • strlen_ : return length of string
  • strdup_ : duplicate string on the heap
  • streq_v1 : check if two strings are identical
  • read_words : read words from a file
  • contains_word : search array of words for a given word

The strlen_ function was discussed in lecture. The strdup_ function is similar in spirit to the arraydup function which is part of a zyBook coding quiz. Both of these functions include a trailing underscore to avoid a namespace conflict with the corresponding function in the standard C library. The read_words function uses the strdup_ function to create an array of words which serves as the dataset for the evaluation. The contains_word function iterates through every word in this array to see if it can find a match using the streq_v1 function.

Implement the streq_v1 function with the following prototype:

1
int streq_v1( const char* str0, const char* str1 );

The function should check if two given strings (str0,str1) are equal. The function should return 0 if all characters match and zero if any characters do not match. Use the following algorithm for the version v1 of this function:

  • Step 1: Use strlen_ to determine the length of str0
  • Step 2: Use strlen_ to determine the length of str1
  • Step 3: If the lengths are not equal, then strings do not match, return 0
  • Step 4: Iterate through both strings with a for loop to check if each character matches
  • Step 5: Return 0 if any characters do not match, otherwise return 1

3. Measuring Execution Time

Now assume we want to quantitatively measure the execution time of the contains_word function. To do this, we can use the time functions provided by the C standard library in the sys/time.h header file. We can use the gettimeofday function to get the current time:

This function takes as a parameter a pointer to a struct of type struct timeval. It uses call-by-pointer semantics to update this struct with the current time with a precision of 10s of microseconds. The struct has two fields: tv_sec is the number of seconds and tv_usec is the number of microseconds since January 1, 1970. We can use gettimeofday like this to quantitatively measure how long it takes to run an experiment.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
// Start tracking time

struct timeval start;
gettimeofday( &start, NULL );

// Run the experiment

// ...

// Stop tracking time

struct timeval end;
gettimeofday( &end, NULL );

// Calculate elapsed time

double elapsed = ( end.tv_sec - start.tv_sec ) +
               ( ( end.tv_usec - start.tv_usec ) / 1000000.0 );

printf( "Elapsed time is %fs\n", elapsed );

Compile the evaluation program and measure the execution time:

1
2
3
% cd ${HOME}/ece2400/sec05
% gcc -Wall -o contains-word-eval contains-word-eval.c
% ./contains-word-eval

You will notice that the execution time is very short ... so short that it is too fast for the resolution of the timer. We need to run a subtrial in a loop many times to make sure we have a long enough experiment that we can get a reasonable accurate time measurement. If you look in the main function you will see a loop like this:

1
2
3
for ( int j = 0; j < nsubtrials; j++ ) {
  result = contains_word( words, num_words, "flower" );
}

Currently nsubtrials is set to 1. Try increasing this to 1e3 (i.e., 1000), 1e4 (i.e., 10000), and then 1e5 (i.e., 100000). Rerun the experiment each time. Confirm that 1e5 provides a long enough experiment to get a reasonably accurate time measurement.

Now try running the evaluation program at least five times with nsubtrials set to 1e5 and write down the results for each trial. Is the execution time always the same? If not, why not?

We need to do several trials and then take the average execution time to ensure we can get a good estimate of the execution time. We want to restructure our evaluation program to look like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// Setup the experiment

int    ntrials     = 5;
int    nsubtrials  = 1e5;
double elapsed_avg = 0.0;

for ( int i = 0; i < ntrials; i++ ) {

  // Start tracking time

  struct timeval start;
  gettimeofday( &start, NULL );

  // Run the experiment

  for ( int j = 0; j < nsubtrials; j++ ) {
    // ...
  }

  // Stop tracking time

  struct timeval end;
  gettimeofday( &end, NULL );

  // Calculate elapsed time

  double elapsed = ( end.tv_sec - start.tv_sec ) +
                   ( ( end.tv_usec - start.tv_usec ) / 1000000.0 );

  elapsed_avg += elapsed;

  printf( "Elapsed time for trial %d is %fs\n", i, elapsed );

}

// Calculate average elapsed time per trial

elapsed_avg = elapsed_avg / ntrials;

printf( "Average elapsed time is %fs\n", elapsed_avg );

Set the ntrials to be 5 and the nsubtrials to be 1e5. Recompile and rerun the experiment. Now experiment with turning optimizations on with the -O3 command line option like this:

1
2
3
% cd ${HOME}/ece2400/sec05
% gcc -Wall -O3 -o contains-word-eval contains-word-eval.c
% ./contains-word-eval

Compare the performance of this experiment with and without optimizations.

4. Profiling Execution Time

The previous section enables us to measure the overall execution time, but we might also be interested to know which functions are taking the most time. This can help us focus on the important hotspots for optimization. We can use profiling to do this kind of performance analysis. We will be using the perf tool to do this kind of profiling. It is important to turn debugging on when doing profiling so we can track the function call stacks. Combining the debugging (-g) with optimizations (-O3) will not slow down your program, but may occasionally produce some confusing situations when debugging or profiling since some of the original program might have been optimized away.

Let's start by recompiling our program with support for debugging:

1
2
% cd ${HOME}/ece2400/sec05
% gcc -Wall -g -o contains-word-eval contains-word-eval.c

We can now use perf to profile our program:

1
2
% cd ${HOME}/ece2400/sec05
% perf record --call-graph dwarf ./contains-word-eval

Do not worry if you see a warning about lost chunks! As long as only a few chunks are lost, then you will still have pretty accurate performance profiling information. perf works by interrupts your program several times throughout the execution to take a "sample". During each sample, perf will stop your program and examine the stack to see the list of function stack frames that are currently allocated on the stack. perf will keep counters of every call stack it sees. After perf is done it will output the total number samples. If this number is too small then the profiling information may not be very accurate. We can display this information like this:

1
2
3
4
5
6
7
8
9
% cd ${HOME}/ece2400/sec05
% perf script report stackcollapse
contains-word-e;contains_word;streq_v1 22
contains-word-e;main 2
contains-word-e;main;contains_word 1713
contains-word-e;main;contains_word;streq_v1 1614
contains-word-e;main;contains_word;streq_v1;strlen_ 14400
contains-word-e;main;read_words;strdup_;__GI___libc_malloc;_int_malloc 1
contains-word-e;streq_v1;strlen_ 1

There is one line for every unique call stack which was sampled. The line shows the name of the call stack and how many of the samples were taken with that call stack. In the above example, 1713 samples where taken with the call stack: main calling contains_word. perf can also provide a more interactive view to explore this profile information using the perf report command but to be honest, I find it incredibly confusing. In this course, we will instead make use of a really wonderful visualization called a flame graph. You can generate a flame graph like this:

1
2
% cd ${HOME}/ece2400/sec05
% perf script report stackcollapse | flamegraph.pl > graph.svg

To view the generated graph.svg, find this file using the explorer in VS Code. Then right click and choose Download. Then open graph.svg on the local workstation using Google Chrome.

In a flame graph the x-axis shows the stack profile population, sorted alphabetically and the y-axis shows the stack depth counting form zero at the bottom. Each rectangle represents a stack frame. The wider a frame is is, the more often it was present in the stacks. The top edge shows what function was executing when the sample was taken, and beneath it is its ancestry. The colors are usually not significant, picked randomly to differentiate frames. See this great article for more on flame graphs:

If you study the flame graph you should be able to observe that almost all of the time is spent in contains_word. Within contains_word, almost all of the time is spent in streq_v1. And within streq_v1, almost all of the time is spent in strlen_. We will learn more about optimization techniques later in the semester, but at a high level, one of the best ways to improve performance is to avoid calling an unnecessary function. We can optimize our streq function to completely avoid calling strlen_ like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
int streq_v2( const char* str0, const char* str1 )
{
  int i = 0;
  int j = 0;

  // Iterate until reach first null terminator
  while ( (str0[i] != '\0') && (str1[j] != '\0') ) {
    if ( str0[i] != str1[j] )
      return 0;
    i++;
    j++;
  }

  // Final check to make sure final characters are equal
  if ( str0[i] != str1[j] )
    return 0;

  return 1;
}

Copy this version into your contains-word-eval.c file and change the contains_word function to call version v2 instead of version v1. Recompile and rerun the experiment. Calculate the speedup using this equation:

\mathrm{speedup} = \frac{T_{unopt}}{T_{opt}}

Where T_{unopt} is the unoptimized execution time and T_{opt} is the optimized execution time.

5. Profiling Memory Usage

The previous sections focused on performance, but we also care about memory usage, particularly heap space usage. We can track how many bytes are allocated in each call to malloc and how many bytes are deallocated in each call to free to measure heap space usage. Take a close look at the strdup_ function and uncomment this line:

1
g_heap_space_usage += len;

This line is incrementing a global counter that keeps track of the number of bytes allocated on the heap. Also uncomment this line in main:

1
printf( "Heap space usage is %d bytes\n", g_heap_space_bytes );

Then recompile and rerun the evaluation program. The evaluation program should now print out the heap space usage in bytes. In this course, we care about maximum stack space usage (harder to measure), the heap space usage of a data structure itself, and the auxillary heap space usage of a specific function. Auxillary heap space usage is the maximum additional heap space usage of a function beyond the heap space usage of any parameters.

6. Debugging Memory Usage

Once we start using pointers and dynamic memory allocation in our C programs, we can encounter a whole new class of bugs related to memory accesses. Valgrind is a tool specifically designed for memory debugging and memory leak detection. You can run Valgrind like this:

1
2
3
4
% cd ${HOME}/ece2400/sec05
% gcc -Wall -g -o contains-word-eval contains-word-eval.c
% valgrind --trace-children=yes --leak-check=full \
    --error-exitcode=1 --undef-value-errors=no ./contains-word-eval

Those are quite a few command line options to Valgrind, so we have created an ece2400-valgrind script. This script is just a simple wrapper which calls Valgrind with the right options. Keep in mind that Valgrind can slow your program down by a factor of 10x.

1
2
3
% cd ${HOME}/ece2400/sec05
% gcc -Wall -g -o contains-word-eval contains-word-eval.c
% ece2400-valgrind ./contains-word-eval

You will notice that Valgrind is reporting a read/write to an illegal memory locations. It will identify the line in the code where the invalid read/write occurs:

1
2
3
4
==51022== Invalid write of size 1
==51022==    at 0x40123D: strdup_ (contains-word-eval.c:55)
==51022==    by 0x4013C2: read_words (contains-word-eval.c:113)
==51022==    by 0x4014B4: main (contains-word-eval.c:145)

If you lookin the strdup_ function you will see this:

1
new_str[len] = '\0';

So Valgrind is telling us that this write is illegal. We are writing to an element in the array that is out of bounds which will cause undefined behavior. Can you figure out the bug in strdup_? Try fixing it and rerun Valgrind to confirm that these illegal read/writes are now gone.

You will notice that Valgrind is reporting a memory leak. It will identify the line in the code where the memory was allocated and then never deallocated:

1
2
3
4
5
==52659== 2,313 bytes in 307 blocks are definitely lost in loss record 2 of 2
==52659==    at 0x4C29F73: malloc (vg_replace_malloc.c:309)
==52659==    by 0x4011FA: strdup_ (contains-word-eval.c:35)
==52659==    by 0x4013C5: read_words (contains-word-eval.c:113)
==52659==    by 0x4014B7: main (contains-word-eval.c:145)

If you look in the read_words function you will see this:

1
words[i] = strdup_( strdup_(buf) );

Valgrind is telling us that the memory allocated by malloc in strdup_ called from here are somehow never being properly deallocated. Note that the program is calling free for each word in the words array at the bottom of the main function:

1
2
for ( int i = 0; i < num_words; i++ )
  free( words[i] );

So the issue is more subtle than just needing to call free at the end of main. Can you figure out where the memory leak is coming from in read_words? Try fixing it and rerun Valgrind to confirm that the memory leak is gone.

7. Experimenting with Profiling for PA2

Let's experiment with profiling for the second programming assignment using what we have learned in this discussion section. You can use the following steps to update your PA repo to make sure you have received the latest code.

1
2
3
4
% mkdir -p ${HOME}/ece2400
% cd ${HOME}/ece2400/netid
% git pull
% tree

where netid is your NetID. If you receive a warning from git, you might need to commit your local changes before pulling the latest code.

Take a look at the evaluation programs for PA2 in pa2-dstruct/eval to see how they use the evaluation strategy described in this discussion section. Instead of using gettimeofday directly, we use the ece2400_timer_reset and ece2400_timer_get_elapsed functions in ece2400-stdlib.h. So running an experiment now looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
// Start tracking time
ece2400_timer_reset();

// Run the experiment

// ...

// Stop tracking time
double elapsed = ece2400_timer_get_elapsed();
printf( "Elapsed time is %fs\n", elapsed );

Remember to always use an evaluation build when running experiments. An evaluation build will turn off all debugging support and turn on all optimizations. Try an evaluation build for PA2:

1
2
3
4
5
6
% cd ${HOME}/ece2400/netid/pa2-dstruct
% mkdir build-eval
% cd build-eval
% cmake -DCMAKE_BUILD_TYPE=eval ..
% make eval
% ./list-push-back-eval

Manually tracking how many bytes are allocated and then deallocated on the stack can be tedious and error prone. So we provide ece2400_malloc and ece2400_free functions in ece2400-stdlib.h which automatically handle tracking how much space has been allocated on the stack. So for example, measuring the auxillary heap space usage in a call to function foo would look like this:

1
2
3
4
5
6
7
8
9
// Start tracking auxillary heap space usage
ece2400_mem_reset();

// Call function foo which might allocate space on the heap
foo();

// Stop tracking auxillary heap space usage
int heap_usage = ece2400_mem_get_aux_usage();
printf( "Aux heap space usage is %d bytes\n", heap_usage );

Using Valgrid is absolutely critical for detecting subtle memory usage bugs especially on the heap. We have setup CMake with a target to automate running all of your tests using Valgrind. Here is an example of using Valgrind:

1
2
3
4
5
6
% cd ${HOME}/ece2400/netid/pa2-dstruct
% mkdir build
% cd build
% cmake ..
% make check
% make memcheck

Your PAs will now need to always pass both check and memcheck.

8. To-Do On Your Own

If you have time, try to further optimize the contains_word function. One possible idea is to precompute the length of each word in the words array before we start timing the experiment. Store these lengths in a new array of integers which is then passed into contains_word. Calculate the length of the given word once. Then you can optimize contains_word to do a quick length check such that the full character-by-character check is only used when two words are the same length. Here is an example implementation of contains_word that uses this optimization:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
int contains_word( char** words, int* words_len,
                   int num_words, const char* word )
{
  int word_len = strlen_(word);

  for ( int i = 0; i < num_words; i++ ) {
    if ( words_len[i] == word_len ) {
      if ( streq_v2( words[i], word ) )
        return 1;
    }
  }

  return 0;
}

This optimization is reasonable if our workload does many matches on a single array of words, so we can amortize the time spent on calculating the length of each word over many calls to contains_word. What is the speedup of using this optimization compared to the previous best performing version?