Lexing

Regular Expressions

See regex notes

Parsing

I personally do not enjoy front end issues as much as back end. And yet a front to back compiler course starts there. Kind of off putting.

BNF Context Free Grammars Parsing Expression Grammar LL LALR

How do you get good error messages

sy brand paper on how compilter diagnostics could be imporved

Grammars

https://en.wikipedia.org/wiki/Formal_grammar String rewrite rules. Distinguishes between terminal and non terminal

Context free means the rules only have raw non terminals on left hand side

https://en.wikipedia.org/wiki/Linear_grammar

Algorithms

List of algorithms - parsing Recursive Descent Earley parser Pratt Parser LL Parser LR Parser packrat allstar

Parser Combinators

String -> [(a,String)] A non deterministic search that consumes a prefix of the input. parsec

Alternative persecptive - each production is string position -> [nextposition]

Just making a parse tree acceptor vs something that outputs something.

Leermakers 1993- The functional treatement of parsing Memoizing parser combinators tabling

Hammer binary format parser combinators in C

Parser Generators

Flex yacc/bison antlr Menhir error handling the new way Sam’s coq stuff https://github.com/slasser/AllStar https://github.com/slasser/CoStar

Semantic Actions

menhir manual http://gallium.inria.fr/~fpottier/menhir/manual.html LR(1) parser menhir offers a coq mode?

Generating LR syntax error messages from examples

Shift Reduce Parsing

See Appell’s book https://en.wikipedia.org/wiki/Shift-reduce_parser

shift takes token off stream and puts onto stack reduce takes top of atxckc

Shift reduce conflicts

Hand Rolled Parsers

So the story goes, really good parsers with good error messaging are hand rolled. What is the structure. What are the good techniques

Recursive Descent

Recusrive Ascent

https://en.wikipedia.org/wiki/Recursive_ascent_parser

recursive acent parsing

Treesitter graph tee sitter souffle tree-sitter

Sexp

S-expressions are a good example. They are kind of the core simple parse, easy enough to do basic ones by hand.

test = """ (hi there 
    (my guy     
((how's () it going ()))))"""

stack = [[]]
depth = 0
tok = []
for n, c in enumerate(test):
    if c == "(":
        depth += 1
        stack.append([])
    elif c == ")":
        depth -= 1
        if depth < 0:
            raise Exception(f"Unbalanced paren at char {n}")
        else:
            e = stack.pop()
            stack[-1].append(e)
    elif c in " \t\n":
        if len(tok) > 0:
            stack[-1].append("".join(tok))
            tok = []
    else:
        tok.append(c)
if depth != 0:
    raise Exception("unclosed paren")
print(stack[0][0])

# recursive descent parser
def parse(n,s):
    ns = len(s)
    sexp = []
    tok = []
    while n < ns:
        c = s[n]
        n += 1
        if c == "(":
            n,x = parse(n,s)
            sexp.append(x)
        elif c == ")":
            return n, sexp
        elif c in " \t\n":
            if len(tok) > 0:
                sexp.append("".join(tok))
                tok = []
        else:
            tok.append(c)
    return n,sexp

print(parse(0,test))

https://rosettacode.org/wiki/S-expressions#Python use regex. We will want to parse numbers and strings.

Flex Bison

https://begriffs.com/posts/2021-11-28-practical-parsing.html Crazy stuff.

lex creates yylex which lexes from a file handler Options can be made to not use globals

the parser calls yylex, which returns if it wants t give bakc a token The tokens correspond to define integers

global variables in the prelude.

Antlr

upper case are lexer rules, lower case are parse rules

grammars can import other grammars

Here is an antlr4 grammar of sexp

/*
The MIT License

Copyright (c) 2008 Robert Stehwien

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

*/

/*
Port to Antlr4 by Tom Everett
*/
grammar sexpression;

sexpr
   : item* EOF
   ;

item
   : atom
   | list_
   | LPAREN item DOT item RPAREN
   ;

list_
   : LPAREN item* RPAREN
   ;

atom
   : STRING
   | SYMBOL
   | NUMBER
   | DOT
   ;

STRING
   : '"' ('\\' . | ~ ('\\' | '"'))* '"'
   ;

WHITESPACE
   : (' ' | '\n' | '\t' | '\r')+ -> skip
   ;

NUMBER
   : ('+' | '-')? (DIGIT)+ ('.' (DIGIT)+)?
   ;

SYMBOL
   : SYMBOL_START (SYMBOL_START | DIGIT)*
   ;

LPAREN
   : '('
   ;

RPAREN
   : ')'
   ;

DOT
   : '.'
   ;

fragment SYMBOL_START
   : ('a' .. 'z')
   | ('A' .. 'Z')
   | '+'
   | '-'
   | '*'
   | '/'
   | '.'
   ;

fragment DIGIT
   : ('0' .. '9')
   ;

Misc

grammar zoo

awesome binary parsing