Last Updated: Thursday, January 15 2004
This guide is intended to provide a "knowledge index" of the basic foundation of knowledge expected to be understood by Linux and BSD programmers. This guide does not apply only to the Linux and BSD communities, as it covers things from the hardware up.
Readers should already have a working understanding of at least one computer language such as HTML or C. Familiarity with a command-line operating system such as Microsoft DOS is also very helpful. If you are a novice programmer, this guide is not for you.
Computers are machines, like cars or can openers. One applies electricity and some effort to an "on" switch, and the machine does something useful for you. They are all designed by humans and intended to be used by humans, and it is imperitive that any programmer keep those two facts in mind at all times.
There are many different kinds of computers, just as there are many different kinds of cars. Some are built for speed, others are built for stability and safety. Regardless of what family a computer is supposed to belong to, all modern computers have the same fundamental components.
Computers take input from devices such as keyboards, mice, and a network. This becomes data, which is represented to people or other computers using devices such as monitors, speakers, or a computer network. Computers also store code, which is a special kind of data. Code, or programs, contain a series of instructions which tell the computer what to do with the data.
Processors are usually grouped by "bits": 16 bit, 32 bit, 64 bit. This number usually refers to the default size of a memory address and/or the size of the general purpose registers. This does not mean that the processor does not contain registers which are not equal to it's "bitness", however. For example, the Intel x86 processors are considered to be 32 bit processors, but modern versions have 64 bit and 128 bit wide registers for use by their SIMD and MMX vector instructions.
Most mature processor families have backward compatibility modes which allow them to execute code written for older processors in the family. For example, the AMD x86-64 family of processors has 64 bit wide general purpose registers and 64 bit wide memory addresses, but requires the code to explicitly ask for 64 bit wide registers; without special instruction prefixes, the x86-64 family processors act just like their 32 bit predecessors and provide 32 bit wide general purpose registers.
The memory in a computer is almost always used in one of the following ways:
Segmentation is primarily an Intel-specific mechanism. The segmentation mechanism operates in parallel with the paging mechanism to divide memory into segments of different classes.
After space for the operating system has been allocated
In this country, younger people are usually more comfortable with technology...
C is the Lingua Franca of UNIX. C as a language is very close to the machine, and nearly all modern programmers are familiar with at least it's concepts and standard library functions (printf, strtok, malloc, free, etc). Many modern languages inherit ideas, concepts and syntax from C, if for no other reason than to make their learning curve more shallow.
C is based on a structured programming model, where the program is divided into chunks called functions. The functions can call each other (and even themselves) via a mechanism called a function call. The program begins at a special function named main, and ends when the main function exits or the program explicitly asks the operating system to shut the program down.
In C, data is given to a function with an argument or arguments. Functions can take no arguments, many arguments, or any number of arguments in between. Functions may only return only one argument. One trick used to get around this limitation is that the function takes a memory address or memory addresses as an argument, and then writes the data to those memory addresses, returning the status of the operation in its return variable.
C provides a device called a "struct" or data structure which allows the programmer to consolidate many different pieces of data into a single structure which can be passed around. This keeps a function from having an excessive number of arguments and reduces overall code size.
C has several built-in data types. In order from smallest to largest, they are: char, short, int, long, float, and double. Each of these can be modified by the signed or unsigned prefixes, and can further be modified by the short and long data size prefixes.
A common confusion among novice C programmers is what is signed by default and what is unsigned by default. The rules are:
ints are used for function return values often, and UNIX library calls usually represent an error condition (for functions which return an int) by returning -1. This can sometimes become a problem for functions which need to return an actual negative one (because the function returns an array index, for example). In these cases, the function usually has a pointer argument which is used to write the result back to the caller's memory space, and returns either 0 or -1 to indicate success or failure. Here's an example:
int get_index(int * index) {
*index = calculate_index();
if(index < INDEX_MIN && index > INDEX_MAX) {
return 0;
}
return 1;
}
int main(void) {
int i;
if(get_index(&i) == -1) {
perror("Invalid index");
}
do_something_with_index(i);
return 0;
}
In this example, the get_index function may need to return an actual -1, but this is interpreted by the main function as an error condition. The solution is to have the get_index function return the index value via a pointer argument (sometimes called an "out" argument) which is accessed by the caller to get the function output the caller is really after.
Bitfields are a feature of the C language commonly used when interfacing with external hardware, unusually dense binary file formats, and certain network protocols.
Bitfields are always declared within a struct using the following syntax:
struct mystruct {
unsigned int bitfield: 3;
unsigned int another_bitfield: 8;
};
The bitfield declaration consists of three parts: the type, the name, and the length. The type must always be of a size greater than or equal to the number of bits in the bitfield. For example, on a machine with a 32 bit "int" type, you could say "int bitfield: 32" but not "int bitfield: 33", because 33 is greater than the size of a 32 bit int.
Some platforms swap the order of the fields within each group of 8 bits, other platforms leave them in the order they are declared. Consider the following code:
struct bitfield_struct {
unsigned int a: 4;
unsigned int b: 4;
};
bitfield_struct bs;
bs.a = 15; // All ones
bs.b = 0; // All zeros
If you tried to write this data to a file or to the network, you would see that some machines send the byte 0xf0 and other machines send the byte 0x0f, depending on the compiler and processor used. Note that only fields within each group of eight bits are backwards if you run into this problem. For example:
struct bitfield_struct_2 {
unsigned int a: 4;
unsigned int b: 4;
unsigned int c: 4;
unsigned int d: 4;
};
bitfield_struct_2 bs2;
bs2.a = 1;
bs2.b = 2;
bs2.c = 3;
bs2.d = 4;
If you're running into the backwards ordering problem, this code will result in four four-bit groups in the order 2, 1, 4, 3, NOT 4, 3, 2, 1.
The solution to this problem is to determine whether your compiler uses a backwards ordering or not and use #ifdef...#endif pairs to define two sets of the data; one with the ordering you really want and one with each 8-bit group of fields in backwards order.
Another possible solution is to declare everything as an array of 8 bits and create get and set methods to read and modify the data in the array using bitmasks.
Procedures or functions provided by libraries are usually indexed by a "symbol", which is a alphanumeric representation of the function name. The restrictions and format of symbols is defined by the ABI used by the library or executable.
The usual terminology used when talking about libraries and symbols is libraries export symbols, and executables and other libraries import symbols. Tools such as WinOBJ from sysinternals and GNU objdump's --syms option can be used to examine the import and export tables of libraries and executables.
Non-C languages such as C++ and Java have structures in the function name (or the "prototype") which cannot easily be represented as alphanumeric characters, and so the compiler "mangles" the function names to fit the symbol naming conventions of the ABI. This process is usually reversable, and many debugging tools provide a mechanism for "de-mangling" the symbol names back into their native source-code representation. For example, this is a C++ symbol from the GNU Common C++ library, as mangled by g++ v3:
release__Q2t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b1i0_3Rep
Once demangled by objdump, the real function name becomes clear:
basic_string<char, string_char_traits<char>, __default_alloc_template<true, 0> >::Rep::release(void)
The GNU binutils suite includes a demangling utility, c++filt(1), which will take a list of symbols on standard input and return the demangled results to standard output. It's particularly useful because it can be configured to do so for most mangling formats, unlike other vendor utilities which are generally capable of only demangling symbols in their own mangling format.
The source files created by programmers are compiled by the compiler into object code. This object code is not executed directly by the operating system. The object code has a calling convention such as "pascal" or "stdcall" which determines how the object code should pass and return data. The three major families of calling conventions are the stack-based, register-based, and heap-based calling conventions.
All stack-based calling conventions pass and return data on the processor stack. This uses a special processor register (which is present on all processors with a hardware stack) called the stack pointer or sometimes just SP. The SP points to the current "top" of the stack. On most processors, the stack grows downwards in memory. This means that as objects are pushed onto the stack, the value of SP decreases rather than increases as the stack grows "larger". A few processors (such as Hewlett-Packard's PA-RISC family of processors) have a stack which grows in the opposite direction, so the value of the stack pointer increases as the stack grows larger.
All register-based calling conventions pass and return data within the processor's registers. Intel-style processors have only four registers generally available for this purpose, so this calling convention is usually only used on those CPUs when the function parameters and return value are very small in size. In many cases large data structures need to be passed between functions compiled to object code which uses a register-based calling convention. In these cases, the registers are set to point to a data structure located somewhere in the system's memory.
Pure heap-based calling conventions are very rare, and not found outside the world of embedded devices. Heap-based calling conventions involve putting parameters into a special memory area, which is then read from and written to by the called function. Many systems (such as Windows NT family's native API) use a hybrid model where some parameters are passed in processor registers, and others are put in memory and pointed to by a register parameter.
Object code is linked together by the programmer into an executable or a library. There are a few rules to this:
Let's go through each of these in more detail:
Objects for one family of processors cannot link to objects for another family of processors.
Different processors have different instruction sets. There is no computer in existance which contains more than one family of processors, so it is not possible to run an executable composed of code for several different processors, which is what you would have if you linked together objects compiled for different processor families.
The phrase "processor family" refers to all processors which are compatible with each other, and most manufacturers provide very good backwards compatibility in their products. Here is a list of some processors. Each processor higher up in the list has features listed in parenthesis which improve upon older processors lower in the list. Code compiled for an older processor will usually run on a newer processor. However, code compiled for a newer processor which uses a feature in the list will not run on an older processor.
SPARC Family
Sparcv9 "UltraSparc" (64-bit addressing)
Sparcv8 (integer multiply and divide instructions)
Sparcv7
Intel x86 Family
AMD Athlon 64 / AMD Opteron (x86-64 instruction set)
Intel Pentium 4 (SSE2)
Intel Pentium 3 / AMD Athlon / Newer Intel Celerons (SSE or "KNI")
Intel Pentium MMX / AMD K7 (MMX instructions)
Intel Pentium / AMD K6 / Cyrix 6x86 (TSC)
Intel 486 / AMD 5x86 (different alignment rules, atomic check-and-decrement
instruction, CPUID)
Intel 386 (32-bit protected mode)
Intel 286 (24-bit protected mode)
IBM Family
IBM S/390 family (64-bit addressing)
Older IBM processors
PowerPC Family
PowerPC 64 (64-bit addressing)
PowerPC G4 (Altivec instructions)
PowerPC G3
Older PowerPC chips
Objects for one operating system cannot link to objects for another operating system.
Different operating systems provide different support facilities. An object which uses facilities (such as networking APIs or graphics APIs) from one operating system may be able to link against objects which use different facilities, but the resulting executable will usually fail at run-time. Some operating systems are capable of running the resulting executable anyway because they emulate other operating systems' facilities, but this does not happen often.
If an object does not use any facilities from the operating system which cannot be found on another operating system (these are called "non-portable" facilities), then the object is usually safe to link against.
C++ objects from one compiler cannot link to C++ objects from another compiler.
There are efforts underway to standardize a cross-vendor C++ ABI which will (theoretically) allow objects from one compiler to link against objects from another compiler. G++ 3.2 boasts nearly complete support for the proposed ABI, and Intel's compiler comes close.
C++ is very different from C. C++ has features such as RTTI and templates which compile into many, many assembly instructions which are sometimes spread out over a wide area in the object code. Different compilers implement these in different ways because there is no standard for doing so for each processor, and so they use different symbol mangling schemes to prevent objects from being able to link to each other. Even if they all used compatible mangling schemes, the differences in the way the compiler designers chose to implement certain features would spell disaster for the executable at runtime as the code from the different compilers fought over internal resources of the processor, or did not respect each others' conventions.
Sometimes, even different versions of the same compiler will produce mutually incompatible C++ object code. A very visible example of this is the transition from G++ 2.x to G++ 3.x in the Linux community. Shared libraries compiled with G++ 2.x are completely incompatible with libraries and executables compiled with G++ 3.x. This is a big problem with application developers who use a Linux distribution which includes C++ libraries. If those libraries are compiled with a different compiler than the one the developer is using to compile an application which uses the libraries, it doesn't work. This problem usually manifests itself as missing symbol errors and various other bizarre linker errors, causing the programmer to poke through the code for hours chasing down red herrings before he finally realizes what's going on. In general, you use the same version of the C++ compiler used to compile the libraries your application depends on or you're in for a world of trouble. This doesn't mean that two different versions of G++ cannot coexist on the same system, though. Because incompatible versions of G++ use different names for their run-time support libraries, it's possible to run a G++ 3.x-compiled application on a system compiled with G++ 2.x; it just isn't possible for it to depend on or link against any C++ libraries included with the system.
C objects from one 32-bit compiler can usually be linked to C objects from another 32-bit compiler.
Many operating systems provide a native compiler written by the operating system vendor (Microsoft, Sun, and SGI all do this). In addition, many companies provide third-party compilers for various reasons. Intel, for example, provides a compiler for Linux and Windows on Intel processors which takes advantage of features found in newer processors (such as SIMD and MMX). Object code produced by the Intel C compiler can almost always be linked against objects produced by other compilers.
Because they usually lack MMUs, 16-bit processors usually have very strict rules on how memory is addressed. Different compilers for 16-bit processors may use different memory management schemes for addressing memory, and in most cases they are incompatible. It is sometimes possible to link 16-bit object code anyway (one example is objects from the Borland and Microsoft compilers for MS-DOS), but usually requires a great deal of set-up and places a number of restrictions on the programmer's ability to address memory.
Java objects must use the JNI to link to C or C++ objects.
Programmers use a compile-run-debug cycle repeatedly to develop code. In between the run and debug stages of this cycle, the programmer checks the output or behavior of the program to verify that their code is performing as intended.
Sometimes, the programmer mistakenly interprets correct output as incorrect output, usually because the programmer does not completely understand something or the requirements are written incorrectly. This occurs more frequently when the output is filtered by another program before being presented to the programmer.
So your routine is outputting garbage, eh? Maybe you forgot to terminate the buffer, so it's actually outputting the correct results but they scroll off the screen when all the garbage comes out of the routine and onto your screen.
In parallel systems which have n execution tasks operating on m sets of independent data, where m < n, there exists the possibility that one of the execution tasks is overwriting or corrupting the data generated by another task. This is solved by the use of synchronization primitives which prevent uncontrolled simultaneous access to the data structure by multiple execution tasks simultaneously.
If your debugger does not support the ability to trace which execution tasks are writing to the data being corrupted, the typical solution is to replace direct writes to the data with an accessor routine which logs the change to the data before actually changing the data; the programmer can then review the log to check if it is a synchronization problem.
Perhaps you misunderstood the return semantics of an API, and that error code isn't really an error code at all, but actually a device handle or a pointer to the data you wanted.
This comes up often with functions like Win32's SendMessage() function or Linux's ioctl() function, which return radically different data depending on the message you're sending or the ioctl you're asking to be performed. It's not enough to consult the documentation for the function that's being called, you must also consult the documentation for the message or ioctl parameter you're passing to the function.
Sometimes, when the debugger has failed to provide you with insight into what's causing the problem, you resort to the stare-at-the-code method of debugging. One class of bug this fails to reveal is the case where you're putting the wrong data into an argument. For example:
#include <stdio.h>
#include <err.h>
int cat;
int apple;
void writeit(int arg1, int arg2) {
cat = arg1;
apple = arg2;
}
int main(int argc, char * argv[]) {
if(argc < 3) {
fprintf(stderr, "Usage: %s APPLE CAT", argv[0]);
exit(1);
}
writeit(atoi(argv[1]), atoi(argv[2]));
printf("cat = %i\n", cat);
printf("apple = %i\n", apple);
return 0;
}
The problem here is that the documentation is wrong. The usage line says that you should pass APPLE as the first argument and CAT as the second argument, but the writeit() function does it backwards: it treats its first argument as CAT and its second argument as APPLE. This kind of bug is especially insidious and time-consuming to find because the programmer naturally suspects a subtle problem, and so they overlook the possibility that the data may simply be entered incorrectly.
This is yet another case for inline function documentation, since the staring-at-the-code method of debugging won't easily reveal the semantics of the function (that is, what you mean by a call to it) unless they're documented. To avoid this kind of bug, you must ensure that the semantics of one function's expected input is exactly the same as the function which is supplying the input; and comparing these meanings to see if they are the same is much easier if they are in English, especially if the functions are long.
Another common situation where this occurs is when you are accessing data by reference, and the reference suddenly starts pointing to different data than when it is originally required. This occurs frequently in parallel systems (such as multithreading) during the creation of a parallel execution task where the context for a new execution task is provided by reference. Consider the following Win32 example:
void dummy(void * arg) {
printf("%i\n", *((int *) arg));
}
...
int i = 5;
while(i--) {
_beginthread(&dummy, 0, &i);
}
The i variable in this example has the (implicit) auto allocation specifier, which means that it is allocated on the stack since this is a Windows machine. The problem is that the threads continue to reference the i variable via their pointer argument, which will result in different results during reads as the parent thread continues to increment the i variable.
Input validation errors occur when a program or subroutine assumes that its input is in a certain format before doing something with it.
The typical defense against input validation errors are to architect your system in such a way that all input must be flagged as valid by validator routines which accept or reject the input before passing the input to routines which make assumptions about the format of the input. The Perl interpreter provides a "taint" mechanism which does exactly this.
Off-by-one errors are a class of programmer errors made when working with arrays or lists incorrectly.
Sentinel based array termination systems, such as the strings in the C language, are especially vulnerable to this bug because the off-by-one error can cause the terminator to be missed during reads or writes and the system will read unintended data until it comes to another sentinel value (or triggers a memory protection fault).
This is typically the result of confusing the size of a thing with the index of the last element of a thing. In a system where indexes are numbered starting from zero, the index of the last element of a thing is always one less than the total size of the thing.
This bug is more likely to occur when multiple such calculations are performed on several objects which are concatenated and then indexed using the same base address.
An example:
/**
* Return the last character of a string.
*/
char * getlast(char * s) {
int last;
last = strlen(s);
return &s[last];
}
The problem here is that the last variable contains the size of s, not the index of the last element of s. If s was "Hello", then strlen() would return 5, because "Hello" contains five characters. However, C arrays are indexed starting from zero, so the index of the 'o' character is 4, not 5. This function will actually return the address of the NULL terminator after the '0' character, which is wrong.
The problem would be compounded if the resulting pointer was used in calculations of additional indexes into a buffer which multiple strings were being concatenated into.
If the size of a thing in an array or list is n, then the index of the last element of the thing in memory units is always (n + b - s), where b is the index of the first element and s is the size of the elements in the array in memory units. Note that s is not always 1 when dealing with strings made up of characters in a system which uses wide character sets such as Unicode.
Buffering is a common performance trick in the family of optimizations via laziness. The basic idea is that some I/O operations are very expensive (such as the UNIX read(2) and write(2) syscalls), so you want to avoid calling them as much as possible. Usually these operations take the same amount of time for all sizes of data below a certain threshold, usually the MMU's page size or the sector size of the I/O device (512 bytes for most modern mass storage devices).
The solution to this is to create a buffer which reads a huge chunk of data at once and then dole out bits of the buffer to the target device (if writing) or the application (if reading). You usually get better performance with larger buffers, but there's a limit past which there are diminishing returns. This limit is usually in one of the following categories:
The f* series of functions provided by C89 and C99 compliant C run time libraries implements buffering, which provides a performance increase over small I/O increments with raw syscalls.
Buffering comes up frequently when dealing with lexers and scanners for processes which examine a sequential-access (as opposed to random-access) input or output stream to change their behavior. One example of this would be a web client.
When writing a sequential-access scanner which is trying to determine the newline boundaries of the input, you can either buffer the entire input and perform random-access scanning on the whole thing in a memory buffer, or you can create a small buffer which takes small bits of the input at a time and performs random-access scanning on a smaller scale.
In the latter case, the buffer must be at least two times the size of the largest possible line in the input so the caller does not have to get its input in chunks. However, in many cases (such as when writing a network server) you do not have control over the client and so you must decide whether you will truncate an excessively long input or you will force the caller to retrieve its data in chunks. Failing to specify some kind of limit for the amount of data to be retrieved will cause your application to be vulnerable to overflowing its buffers, which is a possible security risk. See the gets(3) manpage on a modern UNIX system for an example of an API which fails to specify an input limit.
Many network protocols specify hard limits on the sizes of possible inputs, so you have an idea of where to truncate the input. If the protocol does not specify a limit, such as with a HTTP POST request, you must have a "sink" for the data which does not require random access to the data to process it. In the HTTP POST case, this sink could be a file the data should be written to, a hardware device such as a printer (this is how the Internet Printing Protocol works), or a HTML parser which converts the data into a document tree and graphic buffers to be rendered by the windowing system.