17.8 Application: Control Flow Graphs

In this section, we’ll give a brief introduction to one of the applications of graphs to modelling computer programs. Back in Chapter 16, you learned about abstract syntax trees, which we can use to represent program code. Now, we’re going to look at how to represent not just program code, but also the different possible execution paths through a program. This type of modelling is an extremely powerful tool, and is used by both code checking tools like PythonTA as well as compilers for various programming languages to generate efficient machine instructions from program code.

What is a control flow graph?

Intuitively, a control flow graph is a representation of the different blocks of code in a Python program, and the different paths that the Python interpreter can take through the code. To get a clearer sense of what this means, let’s introduce one foundational definition.

A basic block is a sequence of non-compound statements and expressions in a program’s code that are guaranteed to execute together, one after the other.

Here are some examples and non-examples of basic blocks.

# A single statement is a basic block.
x = 1

# A sequence of multiple statements and function calls is a basic block.
x = 5
y = x + 2
z = f(x, y)
print(x + y + z)

# A basic block can end with a return or raise statement.
x = 5
y = x + 2
return f(x, y)

# But a sequence of statements with a return/raise in the middle is
# NOT a basic block, since the statements after the return/raise aren't
# going to execute.
x = 5
return x
y = x + 2  # Will never execute!

# An if statement is not a basic block, since it is a compound statement.
# The statements it contains aren't guaranteed to execute one after the other.
if x > 5:
    y = 3
else:
    y = 4

Typically we treat basic blocks as being maximal, i.e., as large as possible. So if we have a sequence of assignment statements (x = 5, y = x + 2, etc.), we treat them as one big block rather than consisting of multiple single-statement blocks.

Now let’s look at that if statement example in more detail. We can divide it up into three basic blocks: one for the condition (x > 5), then one for the if branch (y = 3) and one for the else branch (y = 4). When we first learned about if statements in Section 3.4, we drew simple diagrams to show the possible ways the Python interpreter could take through our code. We can now formalize this idea, and extend it to other kinds of control flow statements like loop.

A control-flow graph (CFG) of a program is a graph \(G = (V, E)\) where:

\(V\) is the set of all (maximal) basic blocks in the program code, plus one special elements representing the end of a program. We typically restrict \(V\) to only include reachable basic blocks, i.e., code that can actually be executed at some point by the Python interpreter.
\(E\) is the set of edges, where:
- There is an edge from block \(b_1\) to block \(b_2\) if and only if the code in \(b_2\) can be executed immediately after the code in \(b_1\).
- There is an edge from block \(b\) to the special end block if and only if the program can stop immediately after executing the code in block \(b\). This occurs if there is no code written after \(b\), or if \(b\) ends in a return or raise statement.
Unlike the graphs we’ve seen so far, a control flow graph’s edges are directed, meaning order matters! Formally, we can represent each edge as a tuple rather than a set, where \((b_1, b_2)\) means there’s an edge from \(b_1\) to \(b_2\). This is different from the tuple \((b_2, b_1)\)—it is possible (and common) for an edge \((b_1, b_2)\) to exist, but not \((b_2, b_1)\).

Because our edges are now directed, we will draw them using arrows rather than simple lines. Here are two examples of control flow graph diagrams.

Program code	Control flow graph
`x = 0 print(x)`
`x = 0 print(x) if x > 5: print('x is big') else: print('x is small')`

One subtlety with our second example is that the if condition x > 5 doesn’t appear in its own basic block! Instead, it’s merged with the previous block, since it is always executed immediately after the print(x) call. This illustrates what we mean by treating basic blocks as maximal: while we could have used a separate block for the x > 5, it would have been redundant to do so.

While loops

Next, let’s look at how a while loop is represented using a control flow graph. Here’s a simple example to start with:

Program code	Control flow graph
`x = 0 while x < 10: print(x) x += 1 y = x + 3`

This diagram is a bit more compleicated than the one before. The while loop condition (x < 10) is its own basic block, and has two different blocks that can execute after it: the body of the while loop and the statement after the loop (y = x + 3). After the body of the while loop executes, the while loop condition executes again, which is why there’s an edge from the basic block representing the loop body back to the loop condition.

As we learned when deadling with abstract syntax tree, program code is naturally recursive: inside the body of the while loop, we aren’t just limited to simple statements. We can have compound statements like if statements, or evey other while loops. Here’s a second example program, and corresponding control flow graph:

Program code	Control flow graph
`x = 0 while x < 10: if x > 5: print(x) else: print('x less than 6') x += 1`

The subgraph contained in the red region is a subgraph representing the loop body. If the while loop condition evaluates to True, then the program flow goes “into” this subgraph. Notice that the “end” of the body is represented by the basic block x += 1 in the subgraph; after this point is reached, the while loop condition is executed again.

Return statements

Finally, let’s consider how return statements are represented in our control flow graphs. As we know, when a return statement is executed, the function body immediately terminates. This means that when a basic block contains a return statement, that block must have a single edge leaving it, that goes to the special vertex representing the end of the program. Here’s an example of a control flow graph for a function:

Program code	Control flow graph
`def f(x: int) -> None: if x > 10: return else: y = x + 1 print(x * y)`

Control flow graphs and PythonTA

Now, control flow graphs are not merely of theoretical interest! Like abstract syntax trees, they are used as a form of static analysis when compiling program code or analysing it for errors. To wrap up this section, we’ll see how PythonTA itself uses control flow graphs to perform two different checks on your code.

Tools like PythonTA that perform static analysis on python source code rely heavily on control-flow graphs for some of its checks. Let’s look at two checks PythonTA does, and understand how it uses control-flow graphs to reason about the code.

One iteration check

Here is a common error that students often make when implementing a search through a list: Even though we would normally implement this using a for loop, we’re using a while loop here to make the translation to control flow graphs easier.

def has_even(numbers: list[int]) -> bool:
    """Return whether numbers has an even number.
    """
    i = 0
    while i < len(lst):
        if lst[i] % 2 == 0:
            return True
        else:
            return False

        i += 1

Because the if statement inside the while loop has a return statement in both the if and else branches, this loop will only every run for one iteration, despite the i += 1 at the bottom of the loop body. This means the return value will be based only on the first value in the list, rather than checking all items (in case an even number appears later in the list). There’s actually another bug, when the list is empty. We’re ignoring this bug here—one is enough!

Let’s see how this problem manifests itself in a control flow graph. Here’s the control flow graph for the above function:

Control flow diagram of has_even.

If you compare this to our control flow graph of the while loop example from above, something should jump out at you: none the basic blocks for the loop body connect back to the loop condition, and instead end at the “end block”. And this is how we know that there’s only going to be one iteration of the loop!

In other words, we can reframe the question of whether a loop will always have a single iteration into a question about the corresponding control flow graph: is the basic block of the loop condition part of a cycle consisting of itself and the basic blocks representing the loop’s body? If we can find such a cycle, the loop can have more than one iteration (in at least some cases); if we can’t, then the loop is guaranteed to only execute for one iteration. This is actually the algorithm that PythonTA implements for its “one iteration” checker!

Possibly undefined variable check

The second check we’ll study detects variables that many be used before they are defined. Consider, for example, the following Python code:

x = 0
if is_special(x):
    x = 5
else:
    y = 5
print(x + y)

At the last line print(x + y), the variable y might not be defined, if is_special(x) returns True and the if branch executes. Intuitively, a variable is possibly undefined when it is possible to execute the program and reach that variable being evaluated before it is assigned.

Let’s see how we can represent this property using control flow graphs. Here’s a control flow graph for the above code.

Control flow graph for possibly undefined variable example, with the path from the start node to the print(x + y) node highlighted.

Our intuitive notion of “reaching a variable before it is assigned” is reflected in the control flow graph as a path from the starting basic block to the basic block calling print(x + y) that doesn’t contain any assignment statements to y. So in general, PythonTA checks for possibly undefined variables by searching the possible paths through the program’s control flow graph, looking for the existence of paths where a variable is used without an assignment statement for that variable occurring earlier in the path. Awesome!