Parsing and evaluating mathematical expressions in Swift / Part 2: Building a syntax tree

Ever needed to interpret mathematical expressions with variables, like a.field1 + (a.field2 - b.field1) * 2, in Swift? I did. This series of blog posts will walk you through my solution. This is part 2 of the series:

  1. Tokenization
  2. Building a syntax tree
  3. Evaluating the syntax tree

Parsing a list into a syntax tree

A refresher: we're looking at this expression: a.field1 + (a.field2 - b.field1) * 2. We've now chopped it up into a list of tokens.

What you want next: something that lets you interpret it easily. One option would be a stack, like in Forth or Reverse Polish notation. That would mean transforming the expression to into a Swift array like this:

[.name(["a", "field1"]), .name(["a", "field2"]), .name(["b", "field1"]), .subtraction, .number(2), .multiplication, .addition]

That would allow you interpret it with a stack machine, popping operands and operators off the stack one by one.

Another option, and the option I'm going for, is a syntax tree. It's a tree you can walk to evaluate your expression. Make a tree where each operator is a node and its operands subnodes, then when evaluating first evaluate the subnodes before evaluating the node itself. A diagram of the tree would look like this:

      ┌───│ + │───┐       
      ▼   └───┘   ▼       
┌──────────┐    ┌───┐     
│ a.field1 │ ┌──│ * │──┐  
└──────────┘ │  └───┘  │  
             ▼         ▼  
           ┌───┐     ┌───┐
      ┌────│ - │────┐│ 2 │
      │    └───┘    │└───┘
      ▼             ▼     
┌──────────┐  ┌──────────┐
│ a.field2 │  │ b.field1 │
└──────────┘  └──────────┘

To model that tree in Swift, you'll want a few types. Indirect enums are great for tree nodes:

enum ExpressionOperator {
    case addition
    case subtraction
    case multiplication
    case division

enum ExpressionTreeAtom {
    case number(Double)
    case name([String])

indirect enum ExpressionTreeNode {
    case atom(ExpressionTreeAtom)
    case operation(ExpressionOperator, ExpressionTreeNode, ExpressionTreeNode)

To get from the token list to a data structure built with those types, the thing we have to figure out is operator precedence. There are a bunch of algorithms for solving that. I picked Eli Bendersky's article on precedence climbing as my guide and was happy with the result. The idea is to define precedence levels (* is solved before +) and associativity (in what order is a list of equal-precedence operators solved) for the operators we support, then recursively handle the tokens.

Let's start by defining precedences and associativities:

enum Associativity: Equatable {
    case left
    case right

struct OperatorInfo {
    let precedence: Int
    let associativity: Associativity

extension ExpressionOperator {
    var operatorInfo: OperatorInfo {
        switch self {
        case .addition: return OperatorInfo(precedence: 1, associativity: .left)
        case .subtraction: return OperatorInfo(precedence: 1, associativity: .left)
        case .multiplication: return OperatorInfo(precedence: 2, associativity: .left)
        case .division: return OperatorInfo(precedence: 2, associativity: .left)

With those definitions ready, we're ready for the algorithm itself. It's two mutually recursive functions, walking an ArraySlice of the tokens with the space tokens filtered out. computeAtom handles individual atoms, computeExpr compound expressions. When computeAtom runs into an parentheses it calls computeExpr, and computeExpr calls computeAtom to get a value for the first operand and itself to solve the second one. It's a pretty direct translation of Bendersky's example to Swift, with the difference that he computed values while walking the token list, whereas my code creates the syntax tree for evaluation later.

The precedence values define how expressions are divided into subexpressions, and so control how the recursive computeExpr calls consume parts of the expression.

func computeAtom(tokens: inout ArraySlice<ExpressionToken>) throws -> ExpressionTreeNode {
    guard let tok = tokens.first else {
        throw ExpressionParseError.unexpectedEndOfExpression
    switch tok {
    case .openParen:
        tokens = tokens.dropFirst()
        let subNode = try computeExpr(tokens: &tokens, minimumPrecedence: 1)
        guard case .closeParen = tokens.first else {
            throw ExpressionParseError.unbalancedParentheses
        tokens = tokens.dropFirst()
        return subNode
    case .addition,
        guard let op = ExpressionOperator(token: tok) else {
            preconditionFailure("Failed to convert token \(tok) to Operator")
        throw ExpressionParseError.unexpectedOperator(op)
    case .closeParen:
        throw ExpressionParseError.unbalancedParentheses
    case let .number(num):
        tokens = tokens.dropFirst()
        return .atom(.number(num))
    case let .name(name):
        tokens = tokens.dropFirst()
        return .atom(.name(name))
    case .space:
        preconditionFailure("Unexpected space token")

private func computeExpr(
    tokens: inout ArraySlice<ExpressionToken>,
    minimumPrecedence: Int
) throws -> ExpressionTreeNode {
    var node1 = try computeAtom(tokens: &tokens)
    while let tok = tokens.first {
        guard let op = ExpressionOperator(token: tok) else { break }
        let opInfo = op.operatorInfo
        guard opInfo.precedence >= minimumPrecedence else { break }
        let nextMinimumPrecedence = opInfo.precedence + (opInfo.associativity == .left ? 1 : 0)
        tokens = tokens.dropFirst()
        let node2 = try computeExpr(tokens: &tokens, minimumPrecedence: nextMinimumPrecedence)
        node1 = .operation(op, node1, node2)
    return node1

We can kick off the computation by filtering out spaces and converting our tokens array into an ArraySlice.

func parseTokenizedExpression(expr: TokenizedExpression) throws -> ExpressionTreeNode {
    var tokens = expr.tokens.filter { $0 != .space }[...]
    return try computeExpr(tokens: &tokens, minimumPrecedence: 1)

That'll give us this tree:

    .atom(.name(["a", "field1"])),
            .atom(.name(["a", "field1"])),
            .atom(.name(["b", "field2"]))

Whew. We have now tokenized our string and generated a syntax tree. Only one thing left to do in part 3: evaluating the expression so we can get a value out of it.