How to Create a Programming Language using Python?

In this article, we will show you how to create your own programming language using SLY(Sly Lex Yacc) and Python. It is important to note that this is not a beginners' tutorial, and you need to be familiar with the prerequisites given below.

Requirements

Compiler design knowledge is rough.
Knowledge of lexical analysis, parsing, and other compiler design aspects.
Regular expression understanding.
Knowledge of the Python programming language.

Getting Started

Install the Python version of SLY. SLY simplifies our process by lexing and parsing documents.

Geek alert! Learn Python programming basics with the Python Programming Foundation Course.

The Python DS Course will help you improve your Data Structures concepts for your interview. To get started with your Machine Learning journey, join the Machine Learning - Basic Level Course

pip install sly

Building a Lexer

Compilers begin by converting the character streams (the high level program) into token streams. This is done through a process called lexical analysis. However, SLY simplifies the process

First let’s import all the necessary modules.

Python3

"from sly import Lexer"

Now let's create a class BasicLexer that extends the Lexer class from SLY. Let's build a simple arithmetic compiler. We will need some basic tokens like NAME, NUMBER, and STRING. In any programming language, there will be a space between two characters. Hence, ignore literals are created. Then we also create the basic literals like ‘=’, ‘+’ etc., NAME tokens are basically names of variables, which can be defined by the regular expression [a-zA-Z_][a-zA-Z0-9_]*. STRING tokens are string values and are bounded by quotation marks(” “). This can be defined by the regular expression \”.*?\”.

If we find a digit/s, we should assign it to the token NUMBER and the number should be stored as an integer. As this is a basic programmable script, let's just use integers, however, feel free to extend it to decimals, longs, etc. We can also provide comments. Whenever we find “//”, we ignore whatever that comes next in that line. We do the same thing with new line character. Thus, we have build a basic lexer that converts the character stream to token stream.

Python3

"class BasicLexer(Lexer):

tokens = { NAME, NUMBER, STRING }

ignore = '\t '

literals = { '=', '+', '-', '/',

'*', '(', ')', ',', ';'}

# Define tokens as regular expressions

# (stored as raw strings)

NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'

STRING = r'\".*?\"'

# Number token

@_(r'\d+')

def NUMBER(self, t):

# convert it into a python integer

t.value = int(t.value)

return t

# Comment token

@_(r'//.*')

def COMMENT(self, t):

pass

# Newline token(used only for showing

# errors in new line)

@_(r'\n+')

def newline(self, t):

self.lineno = t.value.count('\n')"

Building a Parser

First let’s import all the necessary modules.

Python3

"from sly import Parser"

Let's now build a BasicParser class that extends Lexer. The BasicLexer passes the token stream to a variable tokens. In most programming languages, the precedence is defined. For example, in the program below, most of the parsing is very simple. No data is passed when there is none. Essentially you can press enter on your keyboard(without typing in anything) and go to the next line. Next, your language should comprehend assignments using the “=”. This is handled in line 18 of the program below. The same thing can be done when assigned to a string.

"Python3

class BasicParser(Parser):

#tokens are passed from lexer to parser

tokens = BasicLexer.tokens

precedence = (

('left', '+', '-'),

('left', '*', '/'),

('right', 'UMINUS'),

)

def __init__(self):

self.env = { }

@_('')

def statement(self, p):

pass

@_('var_assign')

def statement(self, p):

return p.var_assign

@_('NAME "=" expr')

def var_assign(self, p):

return ('var_assign', p.NAME, p.expr)

@_('NAME "=" STRING')

def var_assign(self, p):

return ('var_assign', p.NAME, p.STRING)

@_('expr')

def statement(self, p):

return (p.expr)

@_('expr "+" expr')

def expr(self, p):

return ('add', p.expr0, p.expr1)

@_('expr "-" expr')

def expr(self, p):

return ('sub', p.expr0, p.expr1)

@_('expr "*" expr')

def expr(self, p):

return ('mul', p.expr0, p.expr1)

@_('expr "/" expr')

def expr(self, p):

return ('div', p.expr0, p.expr1)

@_('"-" expr %prec UMINUS')

def expr(self, p):

return p.expr

@_('NAME')

def expr(self, p):

return ('var', p.NAME)

@_('NUMBER')

def expr(self, p):

return ('num', p.NUMBER)"

By using expressions, the parser can also parse arithmetic operations. Consider the following example. Line by line, all of them are turned into token streams and parsed line by line. Therefore, according to the program above, a = 10 resembles line 22. Same for b =20. a + b resembles line 34, which returns a parse tree (‘add’, (‘var’, ‘a’), (‘var’, ‘b’)).

"GFG Language > a = 10

GFG Language > b = 20

GFG Language > a + b

30"

Now we have converted the token streams to a parse tree. Next step is to interpret it.

Execution

Interpreting is a simple process. Hierarchically evaluating arithmetic operations is basically what we do by looking at the tree and walking through it. The entire tree is evaluated recursively until the answer is obtained. Let's consider, for instance, five plus seven plus four. A lexer first tokenizes this character stream into a token stream. The token stream is then parsed to form a parse tree. The parse tree essentially returns (‘add’, (‘add’, (‘num’, 5), (‘num’, 7)), (‘num’, 4)). (see image below)

The interpreter is going to add 5 and 7 first and then recursively call walkTree and add 4 to the result of addition of 5 and 7. Thus, we are going to get 16. The below code does the same process. Python3

class BasicExecute:

"def __init__(self, tree, env):

self.env = env

result = self.walkTree(tree)

if result is not None and isinstance(result, int):

print(result)

if isinstance(result, str) and result[0] == '"':

print(result)

def walkTree(self, node):

if isinstance(node, int):

return node

if isinstance(node, str):

return node

if node is None:

return None

if node[0] == 'program':

if node[1] == None:

self.walkTree(node[2])

else:

self.walkTree(node[1])

self.walkTree(node[2])

if node[0] == 'num':

return node[1]

if node[0] == 'str':

return node[1]

if node[0] == 'add':

return self.walkTree(node[1]) + self.walkTree(node[2])

elif node[0] == 'sub':

return self.walkTree(node[1]) - self.walkTree(node[2])

elif node[0] == 'mul':

return self.walkTree(node[1]) * self.walkTree(node[2])

elif node[0] == 'div':

return self.walkTree(node[1]) / self.walkTree(node[2])

if node[0] == 'var_assign':

self.env[node[1]] = self.walkTree(node[2])

return node[1]

if node[0] == 'var':

try:

return self.env[node[1]]

except LookupError:

print("Undefined variable '"+node[1]+"' found!")

return 0"

Displaying the Output

In order to display the interpreter's output, we need to write some codes. In the code, the lexer should be called first, then the parser, then the interpreter, and finally the output should be retrieved. After that, the output is displayed.

"Python3

if __name__ == '__main__':

lexer = BasicLexer()

parser = BasicParser()

print('GFG Language')

env = {}

while True:

try:

text = input('GFG Language > ')

except EOFError:

break

if text:

tree = parser.parse(lexer.tokenize(text))

BasicExecute(tree, env)It is necessary to know that we haven’t handled any errors. So SLY is going to display error messages whenever you do something that isn't specified by your rules.

Execute the program you have written using,

"python you_program_name.py"

Footnotes

Our interpreter is very basic. Of course, it can be enhanced in a variety of ways. Conditionals and loops can be added. The design can be modular or object oriented. Among the features that can be added to the same are module integration, method definitions, and parameterization of methods.

How to Create a Programming Language using Python?

Requirements

Getting Started

Building a Lexer

Execution

Displaying the Output

Footnotes

You May Also Like!