## Monday, July 30, 2012

### The Semantics of a Familiar Language: Featherweight C

The example languages used to explain programming language semantics are often based on languages that are less familiar to many programmers, such as the lambda calculus or ML. This blog post is motivated by the need to have more example semantics for languages that are familiar to many programmers. One such semantics is that of Featherweight Java by Igarashi, Pierce, and Wadler. FJ does a nice job of capturing the essence of Java with a relatively simple small-step semantics.

In this blog post I begin to do something similar for a subset of the C language that I'm going to call Featherweight C, or FC for short. I should emphasize that the point of FC is educational, as an example in programming language semantics. The point is not to create a complete semantics for C. That has recently been accomplished by Chucky Ellison in his Ph.D. thesis A Formal Semantics of C with Applications. The C language, while often called "small", is actually a rather complex and subtle language once you look carefully at all the details. I suspect that the C language, in all its gory details, is not a language with which many programmers are familiar.

With Featherweight C, I'm going to try and capture a subset of C that is both familiar to many programmers and that captures the essence of C. By that I mean that it includes the following distinguishing characteristics of C:

• functions with stack-based allocation and no lexical scoping,
• pointers and heap allocation,
• arbitrary control-flow via goto, and
• character-based input and output.
Of course, one could select other characteristics of C as being the distinguishing ones. These just strike me as the important ones from a language design standpoint.

Featherweight C draws considerable inspiration from the C subset named Clight by Sandrine Blazy and Xavier Leroy. Featherweight C is a bit smaller syntactically than Clight but a bit closer to C because it includes goto. Blazy and Leroy give a big-step semantics to Clight whereas here I define an abstract machine for Featherweight C, which I think better corresponds to a programmer's mental model of an executing C program. However, there is no deep difference between big-step semantics and abstract machines, as Oliver Danvy and his students have taught us. (I'll blog about the equivalence between big-step semantics and abstract machines someday soon.)

Disclaimer: My knowledge of C is a tad rusty and I put this together in a rush, so don't consider it the gospel truth! If you see anything that is inaccurate with respect to the C standard, please post a comment!

### Syntax

With the above four characteristics in mind and drawing inspiration from Clight, I've carved out the following subset of the C grammar to call Featherweight C, described in more detail below. (I'm still looking for ways to reduce the size of FC without removing any essential characteristics. If you notice something that could be removed, let me know!)

We have just three types in FC: integers, pointers, and function pointers. We keep pointers and function pointers separate because we're going to allow different actions on pointers and function pointers. Function pointers may be used in a function calls whereas (normal) pointers may be dereferenced.

The expressions include variables, integer literals, some arithmetic and logical operations, and pointer operations (address-of and dereference). The l-value subset of the expressions may be used on the left-hand side of an assignment. We do not include any possibly side-effecting expressions (such as assignment or function call) because then the semantics would need to consider sequence points and order of evaluation, which makes my head hurt.

Assignments and function call do appear as statements. However, we do not include void as a potential return type because void is a rather strange type. This choice forces us to make free be it's own statement (because it has a void return type). For control flow, we include an if-goto combination and labeled statements. Because FC has functions, we of course include return statements. In addition to programmer-defined functions, we include three functions from the C standard library: malloc, getchar, and putchar.

Programs in FC are sequences of declarations, which can either be function declarations or global variables. Our functions begin with a sequence of local variable declarations followed by a sequence of statements.

### Operational Semantics via an Abstract Machine

As mentioned above, we'll create an abstract machine to specify the semantics of Featherweight C. An abstract machine is similar to a small-step semantics in that it takes small steps, one transition for each computation. An abstract machine differs from a small-step semantics in that instead of relying solely on textual rewriting of the program itself, an abstract machine includes auxiliary data structures that facilitate the description of what's going on during execution. (An abstract machine differs from a virtual machine in that it still works over program text, whereas a virtual machine works over instructions such as bytecodes.)

The two main auxiliary data structures that we'll need for the FC machine are a procedure call stack and memory. Our treatment of the stack and memory differs from that of a typical computer because we need to be careful to not define behavior that shouldn't be defined, such as accessing past the end of an array or jumping a pointer from one chunk of allocated memory to another using pointer arithmetic. We're trying to create a specification, whereas, in an implementation, you're free to provide behavior for things that are undefined. Also, in most computers the stack is just a region within memory. Here we'll put the control-flow aspects into a separate stack data structure while the values of stack-allocated variables will be in memory. So anything that can be thought of has having an address, including functions, is stored in memory.

Getting more specific, the memory will consist of blocks, one block for each call to malloc. Each block is a tuple of 4-byte values. (We're assuming a 32-bit machine.) An address is therefore a pair $(b,n)$ containing a block identifier $b$ and an offset $n$ into the block.

A stack is a list of frames and a frame is a 4-tuple containing the following information:

1. the address in which to store the return value,
2. the statements following the function call (think return address),
3. the address of the enclosing function, and
4. a mapping from variable names to their addresses.
The address of the enclosing function is needed because the goto statement needs to look at the entire body of the function to determine where to jump to.

Putting this altogether, the state of the machine will consist of the current frame (minus the return value address), the stack, and memory. The formal definition of the state is given below. The garbage value is meant to represent an uninitialized value.

The operational semantics describes transitions (steps) between abstract machine states. However, because FC includes input and output, we'll optionally annotate each transition with an input or output action:

The abstract machine uses three important auxiliary functions: addr computes the address associated with an l-value, val computes the value of an expression, and goto finds the statements that follow a particular label. The $\delta$ function specifies the results for the built-in operators. I'm going to skip its definition for now, but come back later and fill it in.

We also use the following auxiliary notations in the description of the transition relation.

We are now ready to define the transition relation. It will have the form where $\gamma$ is the global variable map.

#### Assignment

First we define the transition for an assignment statement. The machine computes the address of the left-hand side, the value of the right-hand side, stores the value at the address, and moves on to the next statement without emitting any actions. Let's try out this rule on a couple examples, starting with a simple one.

Example 1. Suppose the state of the machine is where We compute the address of the left-hand side and the value of the right-hand side. so we transition to the following state

Example 2. Now for a more complex example involving pointers. Suppose $p$ is a pointer to pointer to a garbage integer and $q$ is a pointer to 42. Suppose the state of the machine is where We compute the address of the left-hand side: and the value of the right-hand side: So the machine transitions to the following state: where $\mu_4$ is like $\mu_3$ except at address $(b_3,0)$:

#### Input and Output

Next let's consider input and output. A call to getchar retrieves a character (represented by an integer) from standard input. We model this by annotating the transition with the input character i. The character is store at the address of the left-hand side. A call to putchar is similar, except that we annotate the transition with the output character given by the argument to putchar, which we evaluate using the val function.

#### Labels and Goto

The label statement is the most trivial of our statements. We simply move on to execute the statement that was labeled. For the conditional goto, we have two cases to consider, depending on whether the conditional expression evaluates to zero or not. First, the non-zero case in which we use the auxiliary goto function to find the statements that follow the given label. In the case of zero, the machine simply falls through to the next statement.

Example 3. Consider the following sequence of statements that implement a while loop. The comments assigning numbers to each statement:

  x = 1;                      // 1
start:                        // 2
if (x == 0) goto end;       // 3
x = x - 1;                  // 4
if (1) goto start;          // 5
end:                          // 6
x = putchar(x + 65);        // 7

The abstract machine executes these statements in the following order: and outputs the letter A (ASCII code 65).

To go into more detail regarding goto, consider the transition from statement 5 to 3. Let $s_i$ refer to the ith statement in the above program. The conditional expression of $s_5$ does not evaluate to 0, so the machine uses the goto auxiliary function to determine the target of the jump, looking for statement labeled start. The result is the sequence of statements following the start label, which is $s_3,s_4,s_5,s_6,s_7$. So the machine transitions to a new state and begins executing at $s_3$.

#### Malloc and Free

A call to malloc allocates a block of n bytes in memory and returns the address of the block, which is assigned into the left-hand side. Because our values are 4 bytes, the allocated block will contain a tuple of n/4 uninitialized values. We use the notation $\mathsf{garbage}^i$ for a tuple of $i$ garbage values. To execute a call to free, we evaluate the argument to an address and then remove that address and its associated block from memory.

#### Function Call and Return

The transition for a function call is the most involved of the transitions, so take a deep breadth and make sure you have some coffee at hand! Computing the address of the left-hand side of the assignment gives us the address in which the return value will be stored. We'll push it along with the current frame onto the stack. We'll compute the value of the expression in the function position to get the address $a_{f'}$ of the function we'll be calling, which we retrieve from memory, and we compute the values of the arguments $\overline{v}$. Next, we allocate single-value blocks $\overline{b}$ in memory, one for each parameter and local variable, and record the mapping of variable names to these addresses in the environment $\rho'$. We initialize the parameter's locations with the argument values and we leave the local parameters uninitialized. Thankfully, returning from a function is much easier. We simply pop the top frame from the stack, store the return value at the appropriate address, and deallocate the storage for the parameters and local variables.

#### Multiple Transitions

We define the multi-transition relation below. The main idea is that we annotate a multi-transition by collecting up all the actions into a sequence of actions.

#### The Semantics of a Whole Program

To wrap up the semantics for FC, we give the meaning for a whole program in terms of the abstract machine transitions. The meaning is the return value from the main function together with the sequence of input and output actions that occurred during program execution. Nothing else is considered observable. The meaning is expressed as a relation because there can be many different action sequences depending on the input. The initial memory $\mu_0$ contains storage for the global variables and functions in program $p$, a pointer to each function, and a block $a_{ret}$ for the return value from main. The global environment $\gamma$ maps the names of the global variables and functions to the addresses of the global variables and the addresses of the function pointers, respectively.

### Type System

The type system for the expressions of FC is defined as follows. Here's an exerpt from the definition of $\Delta$. The type system for the statements of FC is defined as follows. The following is the type rule for function definitions. To type check the whole program, we first collect the types of all the top-level declarations into the global environment $\Gamma_0$. Then we type check every top-level function using $\Gamma_0$ as the environment.

### Conclusion

This post defined a subset of C, Featherweight C, that I hope is familiar to most programmers, and provides an example of using an abstract machine to define operational semantics. There's a few things left to be done: fill in the $\delta$ function, and add more concrete examples, which I'll do the next time I can grab an hour or two. If you see any places in which this semantics is inaccurate, please let me know!