In this series I will explain what control flow is, how it is represented for program analysis, and the various nuances for practical implementations that target Java bytecode.

Basics

Control flow is a term used to describe the order in which the statements or instructions in an imperative program (e.g. C, Java) will execute, kind of like seeing a stack-trace when an exception is thrown in Java, except we are usually more concerned with analysing these paths statically (i.e. at compile time) rather than after the statements have actually executed.

Here is an example of a simple linear Java program, i.e. the instructions are executed in the order they are presented in:

public static void formula(int x, int y) {
    int z = 2*x + x*y;
    System.out.println(z);
}

Note that to compute the value of z, 2*x is computed, then x*y, then these subexpressions are added together and the result is stored in z.

The corresponding bytecode for this method makes this clear:

iconst_2                                  // push 2
iload 0                                   // push x
imul                                      // compute 2*x
iload 0                                   // push x
iload 1                                   // pusy y
imul                                      // compute x*y
iadd                                      // add 2*x + x*y
istore 2                                  // store the result in z
getstatic System.out::PrintStream         // get the System.out object
iload 2                                   // push z
invokevirtual PrintStream println::((I)V) // print z
return                                    // Terminate 

Jumps

Programs are not linear: they jump to and from other functions, perform operations repeatedly in loops and often have some mechanism for catching exceptions or signalling when they happen.

public static int abs(int x) {
    if(x < 0) {
        return -x;
    } else {
        return x;
    }
}

Here we have an example of conditional logic, i.e. depending on the value of x, we take a different branch. In bytecode it looks like this:

L1:
  iload 0 // Push x
  ifge L3 // If x >= 0 then jump to L3, else continue to L2
L2:
  iload 0 // Push x (here x < 0)
  ineg    // Compute -x
  ireturn // Return -x
L3:
  iload 0 // Push x (here x >= 0)
  ireturn // Return x

Here we have certain instructions annotated with labels. Labels are just useful ways to annotate points in the code(program points) and are not executed themselves. Each instruction can be annotated with a label but here we don’t need the ability to reference every individual instruction.

In the actual class file and on the processor the relative instruction offsets are used, i.e. if we want to jump to an instruction that is 2 instructions after the current instruction, then we use that number to jump instead of referencing a label.

You can also see that the instructions between labels resemble the blocks/branches in the original Java.

The important instruction here is on the third line, ifge L3. Instead of testing x < 0, we check the inverse x >=0 and if the inverse is true, then we jump to L3. If the inverse is not true, we continue to the next instruction starting at L2.

Loops

Using the two constructs we’ve learnt about, we can now build all sort of looping patterns.

Semantically for loops and while loops are the same. The only difference is really the source code syntax. For example, the following two pieces of code are the same:

public static int multiply10For(int x) {
    int result = 0;
    for(int i=0; i < 10; i++) {
        result += x;
    }
    return result;
}
public static int multiply10While(int x) {
    int result = 0;
    int i = 0;
    while(i < 10) {
        result += x;
        i++;
    }
    return result;
}

You may be able to see the similaries. The three main parts of the loops here are initialisation of i, the main body of the loop and a predicate that guards whether the body should be executed.

Astute readers may point out that the for loop also has an incremetor i++ and while this is not directly part of the while loop syntax, it is semantically part of the body of both loops. Therefore we can transform between each of these forms.

These types of loops have a general form when translated to bytecode:

Start:
  // Initialisation here
Head:
  if the predicate is true, goto Body, else goto Exit
Body:
  // Loop body here
  goto Head
Exit:
  // Terminate

or equivalently

Start:
  // Initialisation here
  goto Head
Body:
  // Loop body here
Head:
  if the predicate is true, goto Body, else goto Exit
Exit:
  // Terminate

The actual multiply10 methods are both compiled to the exactly same bytecode:

Start:
  iconst_0        // Push 0
  istore 1        // result = 0
  iconst_0        // Push 0
  istore 2        // i = 0
  goto Head       // Unconditional jump to Head
Body:
  iload 1         // Load result
  iload 0         // Load x
  iadd            // result + x
  istore 1        // result = result + x
  iinc 2 1        // i++
Head:
  iload 2         // Push i
  bipush 10       // Push 10
  if_icmplt Body  // If i < 10 then goto Body, else goto Exit
Exit:
  iload 1         // Push result
  ireturn         // Return result

Do While loops

Do while loops are very similar to while loops, except the body is executed once, regardless of the guard. Therefore they execute the body first, then check the predicate:

Start:
  // Initialisation here
  // goto Head -- No goto here, execute the body first time around
Body:
  // Loop body here
Head:
  // Guard/predicate is exactly the same
  if the predicate is true, goto Body, else goto Exit
Exit:
  // Terminate

As an aside, the semantic translation between while loops and do-while loops isn’t as simple as while/for loops even though the bytecode looks almost identical.

Basic Blocks

I mentioned before that it would be fine to give every single instruction it’s own unique preceding label without changing the meaning of the program. Obviously this is not very useful as it makes it harder to see where code can jump to (as you can only jump to points that have a label).

Therefore we want to try and split or chunk the code of a method into as few parts as possible and only insert the labels where they have to be. A basic block is a sequence of instructions that execute one instruction at a time and have a single flow altering instruction at the end.

Here are some examples:

  iconst_5
  istore 0
  bipush 10
  istore 1
  return
new IllegalArgumentException
dup
aload 0
invokespecial IllegalArgumentException <init>::((String)V)
athrow

Flow altering instructions are any that return from a method, goto, conditional jumps, jsr/ret(jump/return subroutine) or athrow (throw an exception). We will not consider method invocations as jumps here as it complicates things and we can get away with modelling blocks regardless.

An example of an invalid basic block is:

iload 0
ifge L3
iload 0
ineg

As the ifge L3 instruction is flow altering and doesn’t occur at the end of the block.

iload 0
iload 0
ifge L3
iflt L4

Would also be invalid as there are multiple flow altering instructions in the block.

Control Flow Graphs

Having introduced these ideas, the final part of this post will be an introduction the Control Flow Graph(CFG). A CFG is a model or representation of the code in a single method in terms of basic blocks(BB’s) and the relationships between them. It is not a graph that displays values but a graph that consists of nodes and edges.

This CFG shows the abs method from earlier. I moved the ireturn instruction from both of the middle blocks to the end (like refactoring) but the code does the same thing. We can see that the edges in this graph convey a sense of “may flow to”.

Some terminology:

  • The head/entry block is the first block that starts executing when a method is called. There can only be one (shown in green here)
  • There can be multiple tail/exit blocks (red) which must end with either a return or throw instruction (assuming the method terminates at all)
  • B is a successor of A means that block B may be executed directly after block A is finished - these are just the edges in the graph
  • A is a predecessor of B means that A comes directly before B.

This is the for/while loop example. We can see that there is a loop between the head and the body blocks, however, the head block has the decision to either execute the body again or go to the exit. In this example, the head block is a successor of the entry block, and the head and body blocks are successors and predecessors of each other!

Conclusion

This post introduced the concepts of control flow and built up to the control flow graph model of program representation. This model is very common and enables a whole host of analyses and transformations that a compiler or optimisation tool can leverage.

In subsequent posts I will describe some uses of CFG’s, extensions to this model when considering the full Java bytecode specification and how we can extend it further to include the effects of method invocations. Stay tuned!