This presentation contains slides from Linux device drivers workshop organized by Emertxe Information Technologies
(http://www.emertxe.com)Category Archives: System Programming.
Posted on | Link
Designing a Window Manager for a Micro-Controller
Posted by Khuram Ali
Posted in System Programming.
Algorithms matter!
Posted by Khuram Ali
Algorithms matter! Knowing which algorithm to apply under which set of circumstances can make a big difference in the software you produce. If you don’t believe us, just read the following story about how Gary turned failure into success with a little analysis and choosing the right algorithm for the job.’
Once upon a time, Gary worked at a company with a lot of brilliant software developers. Like most organizations with a lot of bright people, there were many great ideas and people to implement them in the software products. One such person was Graham, who had been with the company from its inception. Graham came up with an idea on how to find out whether a program had any memory leaks-a common problem with C and C++ programs at the time. If a program ran long enough and had memory leaks, it would crash because it would run out of memory. Anyone who has programmed in a language that doesn’t support automatic memory management and garbage collection knows this problem well.
Graham decided to build a small library that wrapped the operating system’s memory allocation and de-allocation routines, malloc() and free(), with his own functions. Graham’s functions recorded each memory allocation and de-allocation in a data structure that could be queried when the program finished. The wrapper functions recorded the information and called the real operating system functions to perform the actual memory management. It took just a few hours for Graham to implement the solution and, voila, it worked! There was just one problem: the program ran so slowly when it was instrumented with Graham’s libraries that no one was willing to use it. We’re talking really slow here. You could start up a program, go have a cup of coffee-or maybe a pot of coffee-come back, and the program would still be crawling along. This was clearly unacceptable.
Now Graham was really smart when it came to understanding operating systems and how their internals work. He was an excellent programmer who could write more working code in an hour than most programmers could write in a day. He had studied algorithms, data structures, and all of the standard topics in college, so why did the code execute so much slower with the wrappers inserted? In this case, it was a problem of knowing enough to make the program work, but not thinking through the details to make it work quickly. Like many creative people, Graham was already thinking about his next program and didn’t want to go back to his memory leak program to find out what was wrong. So, he asked Gary to take a look at it and see whether he could fix it. Gary was more of a compiler and software engineering type of guy and seemed to be pretty good at honing code to make it release-worthy.
Gary thought he’d talk to Graham about the program before he started digging into the code. That way, he might better understand how Graham structured his solution and why he chose particular implementation options.
Understand the Problem
A good way to solve problems is to start with the big picture: understand the problem, identify potential causes, and then dig into the details. If you decide to try to solve the problem because you think you know the cause, you may solve the wrong problem, or you might not explore other-possibly better-answers. The first thing Gary did was ask Graham to describe the problem and his solution.
Graham said that he wanted to determine whether a program had any memory leaks. He thought the best way to find out would be to keep a record of all memory that was allocated by the program, whether it was freed before the program ended, and a record of where the allocation was requested in the user’s program. His solution required him to build a small library with three functions:
malloc()
A wrapper around the operating system’s memory allocation function
free()
A wrapper around the operating system’s memory de-allocation function
exit()
A wrapper around the operating system’s function called when a program exits
This custom library would be linked with the program under test in such a way that the customized functions would be called instead of the operating system’s functions. The custom malloc() and free() functions would keep track of each allocation and de-allocation. When the program under test finished, there would be no memory leak if every allocation was subsequently de-allocated. If there were any leaks, the information kept by Graham’s routines would allow the programmer to find the code that caused them. When the exit() function was called, the custom library routine would display its results before actually exiting. Graham sketched out what his solution looked like.
The description seemed clear enough. Unless Graham was doing something terribly wrong in his code to wrap the operating system functions, it was hard to imagine that there was a performance problem in the wrapper code. If there were, then all programs would be proportionately slow. Gary asked whether there was a difference in the performance of the programs Graham had tested. Graham explained that the running profile seemed to be that small programs-those that did relatively little-all ran in acceptable time, regardless of whether they had memory leaks. However, programs that did a lot of processing and had memory leaks ran disproportionately slow.
Experiment if Necessary
Before going any further, Gary wanted to get a better understanding of the running profile of programs. He and Graham sat down and wrote some short programs to see how they ran with Graham’s custom library linked in. Perhaps they could get a better understanding of the conditions that caused the problem to arise. The first test program Gary and Graham wrote (Program A),
Program A codeint main(int argc, char **argv) {
int i = o;
for (i = o; i < 1000000; i++)
malloc(32);
exit (o);
}
They ran the program and waited for the results. It took several minutes to finish. Although computers were slower back then, this was clearly unacceptable. When this program finished, there were 32 ME of memory leaks. How would the program run if all of the memory allocations were de-allocated? They made a simple modification to create Program B,
Program B code
int main(int argc, char **argv){
int i= 0;
for (i = o; i < 1000000; i++)
void *x = malloc(32);
free(x);
exit (o);
}
When they compiled and ran Program B, it completed in a few seconds. Graham was convinced that the problem was related to the number of memory allocations open when the program ended, but couldn’t figure out where the problem occurred. He had searched through his code for several hours and was unable t o find any problems. Gary wasn’t as convinced as Graham that the problem was the number of memory leaks. He suggested one more experiment and made an other modification to the program, as Program C in which the de-allocations were grouped together at the end of the program.
Program C code
int main(int argc, char **argv){
int i= 0;
void *addrs[1000000];
for (i = o; i < 1000000; i++){
addrs[i] = malloc(32;)
for (i = o; i < 1000000; i++){
free(addrs[i]);
exit (o);
}
This program crawled along even slower than the first program! This example invalidated the theory that the number of memory leaks affected the performance of Graham’s program. However, the example gave Gary an insight that led to the real problem.
It wasn’t the number of memory allocations open at the end of the program that affected performance; it was the maximum number of them that were open at any single time. If memory leaks were not the only factor affecting performance, then there had to be something about the way Graham maintained the information used to determine whether there were leaks. In Program B, there was never more than one 32-byte chunk of memory allocated at any point during the program’s execution. The first and third programs had one million open allocations.
Allocating and de-allocating memory was not the issue, so the problem must be in the bookkeeping code Graham wrote to keep track of the memory.
Gary asked Graham how he kept track of the allocated memory. Graham replied that he was using a binary tree where each node was a structure that consisted of pointers to the children nodes (if any), the address of the allocated memory, the size allocated, and the place in the program where the allocation request was made. He added that he was using the memory address as the key for the nodes since there could be no duplicates, and this decision would make it easy to insert and delete records of allocated memory.
Using a binary tree is often more efficient than simply using an ordered linked list of items. If an ordered list of n items exists–and each item is equally likely to be sought-then a successful search uses, on average, about n/2 comparisons to find an item. Inserting into and deleting from an ordered list requires one to examine or move about n/2 items on average as well. Computer science textbooks would describe the performance of these operations (search, insert, and delete) as being O(n), which roughly means that as the size of the list doubles, the time to perform these operations also is expected to double.’
Using a binary tree can deliver O(log n) performance for these same operations, although the code may be a bit more complicated to write and maintain. That is, as the size of the list doubles, the performance of these operations grows only by a constant amount. When processing 1,000,000 items, we expect to examine an average of 20 items, compared to about 500,000 if the items were contained in a list. Using a binary tree is a great choice-if the keys are distributed evenly in the tree. When the keys are not distributed evenly, the tree becomes distorted and loses those properties that make it a good choice for searching.
Knowing a bit about trees and how they behave, Gary asked Graham the $64,000 (it is logarithmic, after all) question: “Are you balancing the binary tree?” Graham’s response was surprising, since he was a very good software developer. “No, why should I do that? It makes the code a lot more complex.” But the fact that Graham wasn’t balancing the tree was exactly the problem causing the horrible performance of his code. Can you figure out why? The ma lloc ( ) routine in C allocates memory (from the heap) in order of increasing memory addresses. Not only are these addresses not evenly distributed, the order is exactly the one that leads to right-oriented trees, which behave more like linear lists than binary trees. To see why, consider the two binary trees The (a) tree was created by inserting the numbers 1-15 in order. Its root node contains the value 1 and there is a path of 14 nodes to reach the node containing the value 15. The (b) tree was created by inserting these same numbers in the order <8, 4, 12, 2, 6, 10, 14, 1, 3, 5, 7, 9, 11, 13, 15>. In this case, the root node contains the value 8 but the paths to all other nodes in the tree are three nodes or less., the search time is directly affected by the length of the maximum path.
Algorithms to the Rescue
A balanced binary tree is a binary search tree for which the length of all paths from the root of the tree to any leaf node is as close to the same number as possible. Let’s define depth( Li) to be the length of the path from the root of the tree to a leaf node Li· In a perfectly balanced binary tree with n nodes, for any two leaf nodes, L 1 and L2 , the absolute value of the difference, ldepth( L2)-depth (L1) 1:<;:1; also depth( Li):<;;log( n) for any leaf node Li.’ Gary went to one of his algorithms books and decided to modify Graham’s code so that the tree of allocation records would be balanced by making it a red-black binary tree. Red-black trees (Carmen et al., 2001) are an efficient implementation of a balanced binary tree in which given any two leaf nodes L1 and L2 , depth(LJ)!depth(L1):<;; 2 ; also depth( ):<;;2’log2 ( n+1) for any leaf node L,. In other words, a red-black tree is roughly balanced, to ensure that no path is more than twice as long as any other path.
The changes took a few hours to write and test. When he was done, Gary showed Graham the result. They ran each of the three programs shown previously.
Program A and ProgramC took just a few milliseconds longer than ProgramB. The performance improvement reflected approximately a 5,000-fold speedup. This is what might be expected when you consider that the average number of n odes to visit drops from 500,000 to 20. Actually, this is an order of magnitude off: you might expect a 25,000-fold speedup, but that is offset by the computation over head of balancing the tree. Still, the results are dramatic, and Graham’s memory leak detector could be released (with Gary’s modifications) in the next version of the product.
Side Story
Given the efficiency of using red-black binary trees, is it possible that the malloc() implementation itself is coded to use them? After all, the memory allocation func tionality must somehow maintain the set of allocated regions so they can be safely deallocated. Also, note that each of the programs listed previously make alloca tion requests for 32 bytes. Does the size of the request affect the performan ce of malloc() and free() requests? To investigate the behavior of malloc(), we ran a set of experiments. First, we timed how long it took to allocate 4,096 chunks of n bytes, with n ranging from 1 to 2,048. Then, we timed how long it took to deallo cate the same memory using three strategies:
freeUp
In the order in which it was allocated; this is identical to ProgramC
freeD own
In the reverse order in which it was allocated
freeScattered
In a scattered order that ultimately frees all memory
For each value of n we ran the experiment 100 times and discarded the best and worst performing runs. Figure 1-3 contains the average results of the remaining 98 trials. As one might expect, the performance of the allocation follows a linear trend-as the size of n increases, so does the performance, proportional to n. Surprisingly, the way in which the memory is deallocated changes the perfor mance. freeUp has the best performance, for example, while freeDown executes about four times as slowly.
The empirical evidence does not answer whether malloc() and free() use binary trees (balanced or not!) to store information; without inspecting the source for free ( ) , there is no easy explanation for the different performance based upon the order in which the memory is deallocated.
The Moral of the Story
The previous story really happened. Algorithms do matter. You might ask whether the tree-balancing algorithm was the optimal solution for the problem. That’s a great question, and one that we’ll answer by asking another question: does it really matter? Finding the right algorithm is like finding the right solution to any problem. Instead of finding the perfect solution, the algorithm just has to work well enough. You must balance the cost of the solution against the value it adds. It’s quite possible that Gary’s implementation could be improved, either by optimizing his implementation or by using a different algorithm. However , the performance of the memory leak detection software was more than acceptable for the intended use, and any additional improvements would have been unproductive overhead.
The ability to choose an acceptable algorithm for your needs is a critical skill that any good software developer should have. You don’t necessarily have to be able to perform detailed mathematical analysis on the algorithm, but you must be able to understand someone else’s analysis. You don’t have to invent new algorithms, but you do need to understand which algorithms fit the problem at hand.
References
“Algorithms in a Nutshell by George T. Heineman, Gary Pollice, and Stanley Selkow.
Cormen, Thomas H., Charles E. Leiserson, Ronald L. Rivest, and Clifford St ein,
Introduction to Algorithms, Second Edition. McGraw-Hill, 2001.