Writing a Programming Language 101

Writing your own programming language is incredibly daunting. But if you start by writing a language that evaluates very simple functions it opens your mind to the possibility of writing an entire language and gives you a better understanding of what that would entail. So let’s start there. Let’s write a programming language that adds and multiplies numbers.

What’s in a Programming Language?

1. Parsers: Functions that read input and then tell the program what to do with it
2. Data Structure: The parser puts our input into an abstract syntax which our program can then evaluate
3. Nodes: Each node on the tree represents a value or expression which can then be evaluated to give us a value

Parser

In my opinion, the parser is where the magic happens. Writing a parser tells us how input will be interpreted. This is where you really determine the syntax of the programming language. For instance, in Ruby you can write 2 + 2 or 2+ 2 or (2+2) or ( 2 + 2 ) and they will all equal 4. That’s because there is a Ruby parser that doesn’t care about whitespace here and wants you to have a value on the left and on the right of the ‘+’ sign.

We are going to be very strict with our programming language in order to simplify our code. Before I share the operations that our language can perform, let’s call our code MarkLang (in honor of Mark who taught me how to write a language).

MarkLang Syntax
Addition: open paren + plus sign + whitespace + number + whitespace + number …. + close paren
Valid Examples: (+ 3 3) or (+ 6 2 10 56) or (+ 1 1 0 5 8 9)

Multiplication: open paren + asterisk + whitespace + number + whitespace + number …. + close paren
Valid Examples: (* 4 5) or (* 6 2) or (* 7 7 8 10 12)

Data Structure

The parser reads our input and turns it into structured data. The structure it forms is an abstract syntax tree, or a tree representation of the source code written in a programming language. It creates nodes with branches, like a tree.

Nodes

The parser turns input into nodes. For example, (+ 3 3) will be turned into three nodes. The Plus Node will have two branches off of it with a NumberNode (value = 3) on each end. For another example, let’s think about (+ 3 (+ 5 5)). This input will be turned in a data structure with five nodes. At the top there will be a PlusNode with two branches off it, a Number Node (value = 3) and another PlusNode. That second Plus Node will also have two brances with NumberNode (value = 5) at each end.

Every nodes needs to have an eval function so we can treat all nodes the same from the outside. Then each type of node’s eval method will do something different when we call it. This sounds like a case of polymorphism.

Okay, here are the codes: NumberNode (which just returns the number value), PlusNode(which adds all of the NumberNodes it’s given) and MultiplyNode (which multiplies all of the NumbderNodes it’s given).

class NumberNode
    def initialize(value)
        @value = value
    end
    
    def eval
        @value
    end
end

class PlusNode
    def initialize(nodes)
        @nodes = nodes
    end
    
    def eval
        result = 0
        @nodes.each do |node|
            result += node.eval
        end
        result
    end
end

class MultiplyNode
    def initialize(nodes)
        @nodes = nodes
    end
    
    def eval
        result = 1
        @nodes.each do |node|
            result *= node.eval
        end
        result
    end
end

Next, let’s write the Parser. Our parser will take in a string code like “(+ 2 2)” and parse it. We then go through the parse_expression method recursively to parse the expressions in parentheses and the numbers. We use a @position instance variable to keep our place in the code while we parse different parts of it.

class Parser
  attr_accessor :position, :code
  
  def initialize(code)
    @position = 0
    @code = code
  end
  
  def parse_expression
    if is_digit? code[@position]
      parse_number_node
    elsif code[@position] == '('
      parse_parenthesized_expression
    end
  end
  
  def parse_number_node
    number = ""
    while is_digit?(code[@position])
      number += code[@position]
      @position += 1
    end
    NumberNode.new number.to_i
  end
  
  def parse_parenthesized_expression
    if code[position + 1] == "*"
      parse_multiply_node
    elsif code[position + 1] == "+"
      parse_plus_node
    elsif code[(position + 1)..(position + 2)] == "if"
      parse_if_node
    end
  end
  
  def skip_whitespace
    until code[@position] != ' '
      @position += 1
    end
  end
  
  def parse_arguments
    @position += 3
    node_array = []
    while code[position] != ")"
      skip_whitespace
      node_array << parse_expression
    end
    @position += 1
    node_array
  end
  
  def parse_multiply_node
    MultiplyNode.new(parse_arguments)
  end
  
  def parse_plus_node
    PlusNode.new(parse_arguments)
  end
  
  def is_digit?(char)
    ('0'..'9').include? char
  end
end

Now we have a simple parser with three types of nodes. We just wrote the MarkLang programming language that can add and multiple numbers. Let's try it.

code = "(* 2 2 2)"
parser = Parser.new(code)
result = parser.parse_expression
puts result.eval  #=> 8 

Ta-da!

Leave a Reply

Your email address will not be published. Required fields are marked *