Peg.js and custom grammars

https://git.io/vH1iV

@sudodoki

Sprint-42

Formerly RailsReactor

kottans.org

Task @ RailsReactor

Let user type SQL-like queries. <COLUMN> <KEYWORD> <ARGS>

Hm, and how do they actually do it?

JS-devs & CS

Cool old books

Talk overview

  • theory: common ground & vocabulary
  • PEG.js: not a silver bullet, but it works

Part I: theory

Working with text

  • structured – Parsing
  • unstructured – NLP

ALGOL 60

  • Spec comes out in 1960
  • Ned Irons publishes ALGOL parser, based on formal notation

String -> Tokens -> AST / Parse Tree

"ID = 3"

[<COLUMN, ID> <KEYWORD, EQL> <ARG, 3>]

Parsing application

  • languages / dsl / data format / custom protocol
  • data mining / parsing
  • reading language specs

Formal Grammar

  • T – terminals
  • N – non-terminals
  • S – root
  • R – production rules

BNF

Backus–Naur form

Letter = 'A' / 'B' / … / 'Z'
Digit = [0-9]
Integer = Digit Integer / Digit

Same thing, different production rules

  • <Name> = [A-Z] [a-z]+
  • <Name> = [A-Z] <Letters>
  • <Name> = [A-Za-z] <Name>

Tokens – product of lexical analysis

<name, attribute>

  • <COLUMN, "Id">
  • <KEYWORD, "=">
  • <ARG, 1>

Terminals <=> Tokens

Let's talk only context-free grammars

Building parse trees

  • Bottom-up
  • Top-down
  • [Universal]

Sample grammar

Start = Term
Digit = [0-9]
Sign = '-' / '+'
Term = Digit Sign Term / Digit

Sample tree

2 + 3 - 4

Bottom-up

Top-down

Vanilla JS Recursive descent~ish

let result = [];
let curIndex = 0;
const descend = (input) => {
    term(input);
    return result;
}
const term = (input) => {
 result.push(['term']);
 digit(input);
 if (input[curIndex + 1]) {
   sign(input);
   term(input);
 }
}
const sign = (input) => {
 const curChar = input[curIndex];
 if (isSign(curChar)) {
  result.push(['sign', curChar])
  curIndex += 1;
 } else {
  throw new 
   SyntaxError(`${curChar||'∅'}\
instead of sign`); } }
const isSign = (c) =>
  ['+', '-'].includes(c)
const isDigit = (c) =>
  '0123456789'
    .split('')
    .includes(c);
const parsed = JSON.stringify(descend("2+3-4"))
console.log('parsed: ', parsed)
// parsed:  [["term"],["digit","2"]
// ["sign","+"],["term"],["digit","3"]
// ["sign","-"],["term"],["digit","4"]]

Backtracking

vs

Lookahead

Parser approaches

Other aspects

Part II: practice

PEG.js as representative of PEG

by @dmajda

When to build a parser

  • when simple regexp is not good enough
  • keeping some context / etc
  • loose syntax / casing

Peg.js + chokidar-cli

npm i pegjs chokidar-cli
// in package.json scripts
chokidar 'parens.pegjs' 
  --initial 
  --silent 
  -c 'pegjs -o
    parens-parser.js
    parens.pegjs'

https://pegjs.org/online

Aka 'don't bother with installation'

Basic parsing expressions

  • Terms / Non-terms
  • "literals" / 'literals'
  • . - any char
  • [a-z] [0-9] [\u00C0-\u10FFFF] char ranges
  • * / + / ? / ()

Simple parens grammar

start = group
lpar = '('
rpar = ')'
nothing = ''
group = 
    lpar group rpar
    / group group
    / ''

Eliminating left recursion

Going from this…

group = 
    lpar group rpar
    / group group
    / ''

to this

group = lpar rest
  / lpar group rpar
  / ''
rest = ')' group
  / ')'

Run action

start = FLOAT
SIGN = "-" / "+"
DIGIT = [0-9]
INTEGER = 
  d:DIGIT int:INTEGER
  { /* other side effect */
    return d + int }
  / d:DIGIT 
  { return d }

Helpers

JS code at top of grammar file

{
const notNull = (thing) =>
  thing != null
}

Options / context

parser.parse(string, options);

Predicate

FLOAT = sign:SIGN?
  whole:INTEGER? "."?
  fraction:INTEGER? &{
    return notNull(whole)
    || notNull(fraction) 
  }
  { return { sign,
             whole,
             fraction } }

Thinking about your grammar

  • case sensitivity
  • loose syntax
  • distinguishing lexemes
  • AST

Avoiding ambiguity

  • in terms of building parse tree
  • in terms of deciding token's type

Two cases

"ID" >= (20)
ID less than 20

How to tackle Precedence

// level2 * /
// level1 + -
start = level1
level1 =
    level2 ([+-] level2)+
    / level2
level2 =
    level0 ([*/] level0)+
    / level0
level0 = number / "(" level1 ")"
number = [0-9]+

Coming up with AST

  • Sequential
  • Non-sequential

Sample AST

{
column: { name, location, type }
keyword: { value, location }
argument: [ { value, location }]
}

Non-sequential

{
 complete: true,
 sequence: [
  { column: { name, location, type }},
  { whitespace: true },
  { keyword: { value, location, type }},
  { whitespace: true },
  { argument: [{ value, location }]}
 ]
}

Location

(start / end)

  • offset
  • line
  • column

Validation

  • Syntax error
  • Semantic error

Parsing-on-the-go

ID =

is a valid expression

Possible solutions?

  • ditching peg.js
  • using predicates and complete / non-complete attrs
  • swallowing parse errors and trying to figure out how lethal it is
  • incorporating 'incomplete' grammar
progressiveExpression =
  column:column whitespace keyword:keyword whitespace* argument:inprogArgumentList
  { return { complete: false, sequence: [makeColumn(column), makeWhitespace(),
             makeKeyword(keyword), makeWhitespace(), makeArgument(argument) ] }}
  / column:column whitespace keyword:keyword whitespace+
  { return { complete: false, sequence: [makeColumn(column), makeWhitespace(), 
             makeKeyword(keyword), makeWhitespace()] }}
  … 3 redacted
  / column:column { return { complete: false, sequence: [makeColumn(column)] }}
  / inprogColumn  

Syntax errors and messages

inprogEntity '' = …

Prompting is basically

  • manipulating AST
  • stringifying

Ignoring case

equalOp = ['equal']i 's'i?

What would I redo?

Alternatives

Further reading

Questions?

Book thingy