Parser From Scratch

Why Parser

improve enginerring efficiency..

markdown issue to fix

Tokenizer, Lexiconer, and Parser

這三者有什麼不同?我們直接用例子說明。

Ex1: Input “I love the dog!”

最簡單的 tokenizer (e.g. Tensorflow Keras 附的 tokenizer) 基本只負責斷 token 的工作,不負責 token 的 attribute => [“I”, “love”, “the”, “dog”, “!”].

Lexiconer 則負責 token + attribute, 不負責判斷是否合乎 syntax (句法或文法) => [(“I”, subject), (“love”, verb), (“the”, article), (“dog”, object), (“!”, punctuation)]

Parser 則有各種 rules for syntax, 判斷 Lexiconer 的 attribute 順序是否合乎 syntax, 例如 => S-V-O OK. 但如果是 “I the dog love!” 則 NOT OK.

Ex2: Input: 21 * (32 + 4) + ab

Tokenizer => [“21”, “*”, “(“, “32”, “+”, “4”, “)”, “+”, “ab”]

Lexiconer => [(“2”, int), (“*”, op), (“(“, para), (“3”, int), (“+”, op), (“4”, int), (“)”, para), (“+”, op), (“ab”, var)]

Reference

一般會用 abstract syntax tree (AST) 來表示這個結果。

image-20220203141940144

實務上很多 Tokenizer 已經包含 Lexiconer 的工作。好處是效率高,在 tokenize 順便決定 attribute. 壞處是這種 tokenizer 和 application 基本綁定。例如 markdown language 的 token attributes 和 javascript language 的 token attribute 顯然不同,也無法重覆使用。

Tokenizer:

Input: (unstructured or structured) string/text

Method: regular expression

Output: current tokens which includes

  • Type: NUMBER, STRING
  • Value: 43, “hello”
  • and pointer of next token

Input: Token from to

Parser:

Input: “structured” string, i.e. following predefined syntax

Method: tokenizer + … + validation

Output: AST: abstract syntax tree. => validate if the string follow the pre-defined syntax

AST 可以參考 AST explorer 的 output. 以 markdown lanaguage 為例。

Either use interactive tree format 如下

image-20220201155720924

或是 JSON 格式輸出如下。JSON 格式只適合給 program, 應該不是給人看的。

image-20220201155827066

Parser Step 1: Tokenizer

Input: (unstructured or structured) string of the program

Method: regular expression

Output: current tokens which includes

  • Type: NUMBER, STRING
  • Value: 43, “hello”
  • and pointer of next token

Parser Step 2: Statement List

基本把 input string 切分為 multiple statements, 利用 delimiter 如 ;, enter, etc.

Interpreter (line-base)

前面所說的 tokenizer, lexiconer, parser 都是整個 text 為 input for processing.

從 char processing (tokenizer), 到 token processing (lexicon, parser) 都是如此。沒有 line 的觀念。

另一個 dimension 是從 interactive 的角度來看 parser.

如果是整個 text, e.g. from a file, 顯然不是 interactive.

Interactive 一般是以 line 為單位。

User 打完一行,馬上就會有反應。

所以 interpreter 基本是 line processing.

乍看之下只要把 tokenizer, parser 的 input 變成一行 input 就可以。很多應用這樣即可。

但實際應用可能更複雜,因為有些 operation 可能要多行完成。

Useful Parser

Online AST explorer

Pros

  • 支持多種語言 (Markdown, Python, etc.)
  • 支持 interactive mode (for human) and JSON (for machine)

Cons

  • Only output AST (in JSON format)
  • Cannot be modified?

Python package: mistune (new version 2.0.2; old version 0.8.4)

Pros

  • Output renderer: AST (in XML format) or HTML
  • Can modified the python problem to expand, e.g. support MathJax

Cons

  • Only markdown
  • No interactive mode, hard to debug

Mistune Example (https://spec.commonmark.org/dingus/)

image-20220201225008495

HTML output:

image-20220201225144157

AST output (in XML format)

image-20220201225457362

Parser When and How

Parser 最主要的用途是和 syntax 相關的應用:

  1. Syntax validation and formatting (beautified)
  2. Statistics: 例如多少 headlines, lists, formula, etc.
  3. Translation: e.g. from markdown to html

Tokenizer 最主要的用途是:

  1. find some keyword using regular expression and make modification based on the keyword, for example
    • Remove comment starting with ##
    • Replace wording such as 10.5pF to 10.5e-12 F

一個常見的問題是 multi-lines block processing,

  • 例如 remove comment between /* …. */
  • 或是 nested keywords such as { … { … } … } or begin … begin … end … end

似乎可以用 parser 處理,或是比較好的 tokenizer, 是全局的 string processing, 而非 line by line processing!

而且要可以處理 nested keywords.

其實. parser 的 tokenzier 基本就可以 handle 這個問題。

Mistune: Python’s “Markdown Parser”

Basic Usage

Mistune 的 render output 包含:HTML and AST

基本用法很簡單。就是把 markdown 轉換成 HTML 或是 AST.

v2.0.2: How to Use Mistune — Mistune 2.0.2 documentation

1
2
3
4
5
# v2.0.2
import mistune
text = 'I am using **mistune markdown parser**'
markdown = mistune.create_markdown(renderer = html|ast)
markdown(text)

v0.8.4: Mistune — mistune 0.8.4 documentation

1
2
3
4
5
# v0.8.4
import mistune
text = 'I am using **mistune markdown parser**'
markdown = mistune.Markdown()
markdown(text)

Customized Renderer

客制化可以從 customized renderer 著手。default renderer 是 mistune.Renderer. 有兩類修改。

  1. 使用 mistune built-in renderer function API, 見 appendix.
  2. 如果沒有在 built-in renderer function API, 則更深的客制化。

基本的 markdown 被 cover 在 inline 和 block level. 複雜一點的如 table, footnote, strikethrough 則可用 plugin (also built-in v2.0.2).

但是 math block, figure caption 等更複雜則沒有被 cover. 需要自己處理。

我們用兩個 built-in renderer function 的例子。

Ex1: Markdown to HTML: with code block highlight

Markdown 本身沒有顔色。在引用 code block 全部是黑色。當轉換成 html 希望有顔色 highlight, 如下圖 (e.g. C code):

image-20220205224242429

此時可以調用 mistune 的 block_code(self, code, language=None), 以及利用 pygments 完成。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import mistune
from pygments import highlight
from pygments.lexers import get_lexer_by_name
from pygments.formatters import html

class HighlightRenderer(mistune.HTMLRenderer):
# class HighlightRenderer(mistune.Renderer):  # v0.8.4
    def block_code(self, code, lang=None):
        if lang:
            lexer = get_lexer_by_name(lang, stripall=True)
            formatter = html.HtmlFormatter()
            return highlight(code, lexer, formatter)
        return '<pre><code>' + mistune.escape(code) + '</code></pre>'

renderer=HighlightRenderer()
markdown = mistune.create_markdown(renderer=renderer)
# markdown = mistune.Markdown(renderer=renderer)  # v0.8.4

print(markdown('```python\nassert 1 == 1\n```'))
Ex2: Add figure caption with inline_html

Ex1: Add figure caption at inline html image <img

Appendix

Mistune 0.8.4 version

Mistune 2.0.2 version

  V0.8.4 v2.0.2
Markdown to HTML Mistune.Markdown() Mistune.create_markdown()
Customized Renderer mistune.Renderer mistune.HTMLRenderer
Built-in Renderer function API v0.8.x API v2.0.x
Plug-in   strikethrough, table, footnote

V2.0.2 built-in renderer function API

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# inline level v2.0.2
text(self, text)
link(self, link, text=None, title=None)
image(self, src, alt="", title=None)
emphasis(self, text)
strong(self, text)
codespan(self, text)
linebreak(self)
newline(self)
inline_html(self, html)     # 單行的 html, e.g. <img .... />

# block level
paragraph(self, text)
heading(self, text, level)
thematic_break(self)
block_text(self, text)
block_code(self, code, lang=None)    # 就是 code block: ```lang xxx ``` 
block_quote(self, text)              # 就是 block quote:  > xxx 
block_html(self, html)
block_error(self, html)
list(self, text, ordered, level, start=None)
list_item(self, text, level)

# provided by strikethrough plugin
strikethrough(self, text)

# provide by table plugin
table(self, text)
table_head(self, text)
table_body(self, text)
table_row(self, text)
table_cell(self, text, align=None, is_head=False)

# provided by footnotes plugin
footnote_ref(self, key, index)
footnotes(self, text)
footnote_item(self, text, key, index)

# Finalize rendered content (define output)
finalize(self, data)

V0.8.4 built-in renderer function API

block_code(self, code, language=None)         # 就是 code block: ```lang xxx ```
block_quote(self, text)                       # 就是 block quote:  > xxx 
block_html(self, html)
header(self, text, level, raw=None)
hrule()
list(self, body, ordered=True)
list_item(self, text)
paragraph(self, text)
table(self, header, body)
table_row(self, content)
table_cell(self, content, **flags)


# Span level method names:

autolink(self, link, is_email=False)
codespan(self, text)
double_emphasis(self, text)
emphasis(self, text)
image(self, src, title, alt_text)
linebreak()
newline()
link(self, link, title, content)
strikethrough(self, text)
text(self, text)
inline_html(self, text)                      # 單行的 html, e.g. <img .... />

##

Example:

Ex1: Modified the math block between $$ … $$

Reference

Building a Parser from scratch from Dimitry Soshnikov

https://www.youtube.com/watch?v=0ZDPvdp2uFk&ab_channel=DmitrySoshnikov