Peg.js and custom grammars

https://git.io/vH1iV

@sudodoki

Sprint-42

Formerly RailsReactor

kottans.org

Task @ RailsReactor

Let user type SQL-like queries. <COLUMN> <KEYWORD> <ARGS>

Hm, and how do they actually do it?

JS-devs & CS

Cool old books

Talk overview

theory: common ground & vocabulary
PEG.js: not a silver bullet, but it works

Part I: theory

Working with text

structured – Parsing
unstructured – NLP

ALGOL 60

Spec comes out in 1960
Ned Irons publishes ALGOL parser, based on formal notation

String -> Tokens -> AST / Parse Tree

"ID = 3"

↓

`[<COLUMN, ID> <KEYWORD, EQL> <ARG, 3>]`

↓

Parsing application

languages / dsl / data format / custom protocol
data mining / parsing
reading language specs

Formal Grammar

T – terminals
N – non-terminals
S – root
R – production rules

BNF

Backus–Naur form

Letter = 'A' / 'B' / … / 'Z'
Digit = [0-9]
Integer = Digit Integer / Digit

Same thing, different production rules

<Name> = [A-Z] [a-z]+
<Name> = [A-Z] <Letters>
<Name> = [A-Za-z] <Name>
…

Tokens – product of lexical analysis

`<name, attribute>`

<COLUMN, "Id">
<KEYWORD, "=">
<ARG, 1>

Terminals <=> Tokens

Let's talk only context-free grammars

Building parse trees

Bottom-up
Top-down
[Universal]

Sample grammar

Start = Term
Digit = [0-9]
Sign = '-' / '+'
Term = Digit Sign Term / Digit

Sample tree

`2 + 3 - 4`

Bottom-up

Top-down

Vanilla JS Recursive descent~ish

let result = [];
let curIndex = 0;
const descend = (input) => {
    term(input);
    return result;
}

const term = (input) => {
 result.push(['term']);
 digit(input);
 if (input[curIndex + 1]) {
   sign(input);
   term(input);
 }
}

const sign = (input) => {
 const curChar = input[curIndex];
 if (isSign(curChar)) {
  result.push(['sign', curChar])
  curIndex += 1;
 } else {
  throw new 
   SyntaxError(`${curChar||'∅'}\
instead of sign`); } }

const isSign = (c) =>
  ['+', '-'].includes(c)
const isDigit = (c) =>
  '0123456789'
    .split('')
    .includes(c);

const parsed = JSON.stringify(descend("2+3-4"))
console.log('parsed: ', parsed)
// parsed:  [["term"],["digit","2"]
// ["sign","+"],["term"],["digit","3"]
// ["sign","-"],["term"],["digit","4"]]

Backtracking

vs

Lookahead

Parser approaches

Other aspects

LL / LR / LALR / SLR
SAX
PEG & linear time => http://bford.info/packrat/
Formal grammars and Automatons

Part II: practice

PEG.js as representative of PEG

by @dmajda

When to build a parser

when simple regexp is not good enough
keeping some context / etc
loose syntax / casing

Peg.js + chokidar-cli

npm i pegjs chokidar-cli
// in package.json scripts
chokidar 'parens.pegjs' 
  --initial 
  --silent 
  -c 'pegjs -o
    parens-parser.js
    parens.pegjs'

https://pegjs.org/online

Aka 'don't bother with installation'

Basic parsing expressions

Terms / Non-terms
"literals" / 'literals'
. - any char
[a-z] [0-9] [\u00C0-\u10FFFF] char ranges
* / + / ? / ()

Simple parens grammar

start = group
lpar = '('
rpar = ')'
nothing = ''
group = 
    lpar group rpar
    / group group
    / ''

Eliminating left recursion

Going from this…

group = 
    lpar group rpar
    / group group
    / ''

to this

group = lpar rest
  / lpar group rpar
  / ''
rest = ')' group
  / ')'

Run action

start = FLOAT
SIGN = "-" / "+"
DIGIT = [0-9]
INTEGER = 
  d:DIGIT int:INTEGER
  { /* other side effect */
    return d + int }
  / d:DIGIT 
  { return d }

Helpers

JS code at top of grammar file

{
const notNull = (thing) =>
  thing != null
}

Options / context

parser.parse(string, options);

Predicate

FLOAT = sign:SIGN?
  whole:INTEGER? "."?
  fraction:INTEGER? &{
    return notNull(whole)
    || notNull(fraction) 
  }
  { return { sign,
             whole,
             fraction } }

Thinking about your grammar

case sensitivity
loose syntax
distinguishing lexemes
AST

Avoiding ambiguity

in terms of building parse tree
in terms of deciding token's type

Two cases

"ID" >= (20)
ID less than 20

How to tackle Precedence

// level2 * /
// level1 + -
start = level1
level1 =
    level2 ([+-] level2)+
    / level2
level2 =
    level0 ([*/] level0)+
    / level0
level0 = number / "(" level1 ")"
number = [0-9]+

Coming up with AST

Sequential
Non-sequential

Sample AST

{
column: { name, location, type }
keyword: { value, location }
argument: [ { value, location }]
}

Non-sequential

{
 complete: true,
 sequence: [
  { column: { name, location, type }},
  { whitespace: true },
  { keyword: { value, location, type }},
  { whitespace: true },
  { argument: [{ value, location }]}
 ]
}

Location

(start / end)

offset
line
column

Validation

Syntax error
Semantic error

Parsing-on-the-go

`ID =`

is a valid expression

Possible solutions?

ditching peg.js
using predicates and complete / non-complete attrs
swallowing parse errors and trying to figure out how lethal it is
incorporating 'incomplete' grammar

progressiveExpression =
  column:column whitespace keyword:keyword whitespace* argument:inprogArgumentList
  { return { complete: false, sequence: [makeColumn(column), makeWhitespace(),
             makeKeyword(keyword), makeWhitespace(), makeArgument(argument) ] }}
  / column:column whitespace keyword:keyword whitespace+
  { return { complete: false, sequence: [makeColumn(column), makeWhitespace(), 
             makeKeyword(keyword), makeWhitespace()] }}
  … 3 redacted
  / column:column { return { complete: false, sequence: [makeColumn(column)] }}
  / inprogColumn

Syntax errors and messages

inprogEntity '' = …

Prompting is basically

manipulating AST
stringifying

Ignoring case

equalOp = ['equal']i 's'i?

Peg.js and custom grammars

https://git.io/vH1iV

@sudodoki

Sprint-42

Formerly RailsReactor

kottans.org

Task @ RailsReactor

Hm, and how do they actually do it?

JS-devs & CS

Cool old books

Talk overview

Part I: theory

Working with text

ALGOL 60

String -> Tokens -> AST / Parse Tree

"ID = 3"

[<COLUMN, ID> <KEYWORD, EQL> <ARG, 3>]

Parsing application

Formal Grammar

BNF

Backus–Naur form

Same thing, different production rules

Tokens – product of lexical analysis

<name, attribute>

Terminals <=> Tokens

Let's talk only context-free grammars

Building parse trees

Sample grammar

Sample tree

2 + 3 - 4

Bottom-up

Top-down

Vanilla JS Recursive descent~ish

Backtracking

vs

Lookahead

Parser approaches

Other aspects

Part II: practice

PEG.js as representative of PEG

by @dmajda

When to build a parser

Peg.js + chokidar-cli

https://pegjs.org/online

Aka 'don't bother with installation'

Basic parsing expressions

Simple parens grammar

Eliminating left recursion

Going from this…

to this

Run action

Helpers

JS code at top of grammar file

Options / context

Predicate

Thinking about your grammar

Avoiding ambiguity

Two cases

How to tackle Precedence

Coming up with AST

Sample AST

Non-sequential

Location

(start / end)

Validation

Parsing-on-the-go

ID =

Possible solutions?

Syntax errors and messages

Prompting is basically

Ignoring case

What would I redo?

Alternatives

Further reading

Questions?

Book thingy

sudodoki

@sudodoki

sudodoki

smd.deluzion@gmail.com

`[<COLUMN, ID> <KEYWORD, EQL> <ARG, 3>]`

`<name, attribute>`

`2 + 3 - 4`

`ID =`