Jeremy Siek: 2009

Monday, December 21, 2009

The ECD Abstract Machine, A Programmer's Operational Semantics

There are many different styles of operational semantics but my favorite is not very well known. Hence this post. While in graduate school, I took a course on type systems from Amr Sabry in which we studied a miniature version of SML and used the style of operational semantics that I'm about to write about. Amr didn't give a name to this style, so I'm calling it the ECD abstract machine.

Why do I like the ECD machine? The ECD machine works a lot like a debugger. A debugger session has three components: a view of the source code for the currently executing procedure with the current position marked, a list of in-scope variables and their values, and a stack of the procedure calls. The ECD machine has the same three components.

Historical aside: the ECD machine is closely related to the SECD virtual machine created by Peter Landin. The ECD machine drops the operand Stack and instead uses evaluation contexts.

In the following I'm going to write down what an ECD machine looks like for the lambda calculus. The grammar for the lambda calculus is given below (using the keyword "fun" instead of "lambda"). Note that function application is just two expressions next to each other, where the first is the function and the second is the argument. The id terminal is for identifiers (variable names). $\begin{array}{rcl} expr &::=& id \mid expr\, expr \mid \lambda id. expr \mid value \\ value &::=& \langle \lambda id. expr, env \rangle \end{array}$ The only kind of value (the result of running the program) in the lambda calculus is a closure, which is the result of evaluating a function (a lambda). A closure is just a tuple containing a lambda and an environment. An environment (env) is a function from identifiers to values. Yes, this is a bit circular!

Unfortunately, the lambda calculus looks rather different from your typical imperative programming language, so that may make this particular ECD machine more difficult to understand to a reader not familiar with the lambda calculus or functional programming.

First, a word about how to represent source code with a mark on the current position. Because we're dealing with an expression-oriented language, the current position is not a line number but instead a sub-expression. So the current position can be visualized as a circle drawn around the next sub-expression to be evaluated. The traditional way to represent this is with two pieces: the first piece is a data structure called an evaluation context that represents the source code outside the circle. The second piece is just the sub-expression inside the circle. The following is the grammar for evaluation contexts for the call-by-value version of the lambda calculus. The $\Box$ is the hole in the context, i.e., the location of the circle. $\begin{array}{rcl} \mathit{EvalContext} ::= \Box \mid \mathit{EvalContext} \,expr \mid value \,\mathit{EvalContext} \end{array}$ The function fill takes an evaluation context and an expression and returns the result of plugging the expression into the hole and then rebuilding the rest of the program. In the following we use lowercase e's for expressions and uppercase E's for evaluation contexts. We use the notation $E[e]$ as shorthand for $\mathit{fill}(E,e)$ . $\begin{align*} \Box[e] = e \\ (E\, e_2)[e] = E[e]\, e_2\\ (e_1 \, E)[e] = e_1\, E[e] \end{align*}$

Next, let's describe the ECD abstract machine. As stated above, the ECD has three components. The first is an Environment, the second is the Control, which we will represented with an expression of the lambda calculus, and the third component, the strangely named Dump, is the call stack. The following are the reduction rules for the ECD abstraction machine. The variable x ranges over variables, s over stacks, and r over environments. Each reduction rule has a name given in parenthesis on the right-hand side. $\begin{align*} (r, E[x], s) &\longrightarrow (r, E[r(x)], s) & \text{(VAR)} \\ (r, E[\lambda x.e], s) &\longrightarrow (r, E[\langle \lambda x. e, r\rangle], s) & \text{(LAM)} \\ (r, E[\langle \lambda x.e',r'\rangle \,v], s) &\longrightarrow (r'[x:=v], e', (E,r) s) & \text{(APP)} \\ (r, v, (E,r') s) &\longrightarrow (r', E[v], s) & (RET) \\ \end{align*}$ The VAR rule handles the case of evaluating a variable by looking it up in the environment. The LAM rule evaluates a lambda into a closure, capturing the current environment in the second part of the closure. The APP rule starts a function call whereas the RET rule finishes a function call. Each element of the call stack is a tuple containing an evaluation context and an environment.

Let's finish with an example: $\begin{align*} & (\emptyset, (\lambda x. (\lambda y. x))\, (\lambda z. z)\, (\lambda w. w), []) \\ (LAM) \longrightarrow\;\;& (\emptyset, \langle \lambda x. (\lambda y. x), \emptyset\rangle \, (\lambda z. z) \, (\lambda w. w), []) \\ (LAM) \longrightarrow\;\;& (\emptyset, \langle \lambda x. (\lambda y. x), \emptyset\rangle\, \langle \lambda z. z, \emptyset\rangle \, (\lambda w. w), []) \\ (APP) \longrightarrow\;\;& (\{x:=\langle \lambda z. z, \emptyset \rangle\}, (\lambda y. x), [ (\Box\, (\lambda w. w), \emptyset) ]) \\ (LAM) \longrightarrow\;\;& (\{x:=\langle \lambda z. z, \emptyset \rangle\}, \langle \lambda y. x, \{x:=\langle \lambda z. z, \emptyset \rangle\}\rangle, [ (\Box\, (\lambda w. w),\emptyset) ]) \\ (RET) \longrightarrow\;\;& (\emptyset, \langle \lambda y. x, \{x:=\langle \lambda z. z, \emptyset \rangle\}\rangle\, (\lambda w. w), []) \\ (LAM) \longrightarrow\;\;& (\emptyset, \langle \lambda y. x, \{x:=\langle \lambda z. z, \emptyset \rangle\}\rangle \,\langle \lambda w. w, \emptyset \rangle, []) \\ (APP) \longrightarrow\;\;& (\{x:=\langle \lambda z. z, \emptyset \rangle, y:=\langle \lambda w. w, \emptyset \rangle\}, x, [(\Box, \emptyset)]) \\ (VAR) \longrightarrow\;\;& (\{x:=\langle \lambda z. z, \emptyset \rangle, y:=\langle \lambda w. w, \emptyset \rangle\}, \langle \lambda z. z, \emptyset \rangle, [(\Box,\emptyset)]) \\ (RET) \longrightarrow\;\;& (\emptyset, \langle \lambda z. z, \emptyset \rangle, []) \end{align*}$

A parting question. Is the ECD machine space efficient with regards to tail-recursive functions? If not, how would you modify it to be space efficient?

Friday, December 04, 2009

Greatest Common Divisor

Euclid's algorithm for computing the greatest common divisor of two integers is beautiful because it is extremely simple and also captures an interesting property of linear equations. The equation ax+by = c has an integer solution if and only if gcd(a,b) divides c, where gcd is Euclid's algorithm written below. Recall that x divides y means there exists some n such that xn = y.

gcd(a,b) =
  if a == 0 then
     b
  else if b == 0 then
     a
  else if b < a then
     gcd(a - b, b)
  else
     gcd(a, b - a)

Proving that Euclid's algorithm really works is a good exercise in applying strong induction. We are going to prove that gcd(a,b) is the greatest common divisor of a and b. To apply strong induction, we need to pick a number to do induction on. The numbers a or b are obvious candidates, but neither does the job. Consider the two branches of the "if" expression in gcd. If we choose to do induction on a, then the "else" branch will cause us trouble because we won't be able to apply the induction hypothesis for gcd(a, b - a). If we choose to do induction on b, then we'll have the same kind of trouble in the "then" branch. We need some number that gets smaller in both branches. It turns out that a + b is such a number.

Theorem (Correctness of gcd).
gcd(a,b) is the greatest common divisor of a and b.
Proof.
We proceed by strong induction on a + b. When trying to prove something about a function like gcd, it often helps to structure your proof in a way that mimics the definition of the function. That is, we'll do case analysis in the proof in a way that matches the cases in the definition of gcd.
Case a = 0:
In this case gcd(a,b) = b. We know that b divides 0 and b divides b. Also, for any other divisor d of a and b, it is trivially true that d divides b. Thus, gcd(a,b) is the greatest common divisor of a and b.
Case not (a = 0) and b = 0:
The reasoning is the mirror image of the previous case and left for the reader.
Case not (a = 0) and not (b = 0):
Without loss of generality, assume that b < a. Then gcd(a,b) = gcd(a - b, b). Note that (a - b) + b < a + b. So by the induction hypothesis we know that gcd(a - b, b) is the greatest common divisor of a - b and b and so is its equal, gcd(a,b). Because gcd(a,b) divides both a - b and b, gcd(a,b) divides a, so gcd(a,b) is a common divisor of a and b. To finish we need to show it is the greatest. Assume d is an arbitrary common divisor of a and b. Then d divides a - b and because gcd(a,b) is the greatest common divisor of a - b and b, we can conclude that d divides gcd(a,b). We therefore proved that gcd(a,b) is the greatest common divisor of a and b.
QED.

A proof of this theorem in Isabelle can be found here.

Monday, November 30, 2009

Strong Induction

Induction, in one form or another, is the main tool that computer scientists use to prove properties of the systems that they build. Induction has many forms, some of which are much more suitable to certain situations than others, so it's a good idea to know the many different forms.

Curiously enough, most of the different forms of induction boil down to good old mathematical induction. So in some sense, all one really needs to learn is mathematical induction. Nevertheless, the "boiling down" of the different forms to mathematical induction is not completely trivial, so it still makes sense to learn the others.

Recall that mathematical induction goes as follows:

If you want to prove something about all the natural numbers, forall n, P(n), it suffices to prove that

P(0)
forall k, P(k) implies P(k+1)

Mathematical induction is directly applicable in many situations, but it also falls down in some cases. For example, suppose you want to prove some property about binary trees and try to do induction on the height of the tree. You'd like to use the induction hypothesis to prove the property in question for the sub-trees. However, the height of each sub-tree is not necessarily one less than the height of the current tree (the height could be even less). Thus, you wouldn't be able to apply the induction hypothesis to each sub-tree.

In this situation, strong induction (a.k.a. complete induction or course of values induction) is a a much better fit (and structural induction is an even better fit, but I'll wait to talk about that). Recall that strong induction goes as follows:

To prove a property about all integers, forall n, P(n), it suffices to prove that for any k, if for any m where m < k, you have P(m), then P(k).

So the nice thing about strong induction is that you get to assume that the property is true of all the natural numbers less than k, instead of just k - 1 as is the case in mathematical induction.

Even though strong induction is much easier to apply in many situations than mathematical induction, strong induction boils down to mathematical induction. Here's the proof of strong induction by use of mathematical induction.

As with many proofs, the strong induction principle cannot be proved directly by induction, but instead a slightly stronger version can be proved instead. This seems unintuitive! Why would something stronger be easier to prove? The reason is that when proving something by induction, in the induction step you get to assume what you're trying to prove (for n-1) so the stronger the property, the more horsepower you have to get through the induction step. In this case, instead of proving forall k, P(k) we'll prove that forall j, forall k < j, P(k). It's easy to see that the later implies the former. Just pick j=k+1 and you have P(k).

Lemma. If (forall n, (forall k < n. P k) implies P n), then (forall j, forall k < j, P(k)).

The proof is by mathematical induction on j.

Base case (j=0):

There are no k < 0, so this case is vacuously true.

Induction step:

The induction hypothesis is: if (forall n, (forall k < n. P(k)) implies P(n)), then (forall k < j. P(k)). We need to show the corresponding property for j + 1. We proceed by assuming that

forall n, (forall k < n. P(k)) implies P(n) (call this fact H)

and need to prove that forall k < j+1, P(k). We then pick an arbitrary k less than j+1 and need to show that P(k). Note that by the induction hypothesis combined with H, we know that

forall m < j. P(m) (call this fact A).

Now, because k < j+1 we have two cases to consider:

Case k < j: We can apply fact A to conclude that P(k). Note that without the modification to strengthen this lemma, we would have been stuck here.

Case k = j: From fact A and H, we have P(j). But since j=k we can conclude that P(k).

QED.

Here is a version of the above proof written in a machine-checkable language (Isabelle).