Flutterby™! : Additional writing languagesparsers

Additional writing languagesparsers

2025-03-11 00:25:02.261769+01 by Dan Lyke 5 comments

Additional writing languages/parsers question: I've seen two modern texts which do lexing and parsing as separate tasks. This makes some things easier, but for at least one of my applications makes complexity harder to manage. I'd love to read someone exploring the "why" on that.

[ related topics: Writing ]

comments in ascending chronological order (reverse):

#Comment Re: Additional writing languagesparsers made: 2025-03-11 18:56:07.906134+01 by: Mars Saxman

It's a performance thing. Lexing produces tokens, which are defined by regular expressions, whereas parsing yields syntax trees, which are defined by context-free grammars, which can describe recursive productions: not possible in a regular expression. So the lexer can be implemented as a finite state machine, but the parser must be more complex.

You might be interested in parser combinators... there's not necessarily any distinction between lexing and parsing with that approach.

#Comment Re: Additional writing languagesparsers made: 2025-03-12 04:09:17.363242+01 by: Dan Lyke

Interesting. I'm now thinking through what I even think I mean by having them separate and together.

The system I wrote for work has a lexer that generates an array of tokens (or not, if there are invalid specifications, like malformed strings), and then a recursive descent parser that operates on that array.

It's been useful for migrating from if ([cmd hasPrefix:@"command "]) to have the intermediate lexer, but as I'm writing the same concept in Rust, everything in the parse tree is just a node that does something, contains matchers for tokens (which generally incorporate the lexer functionality), or contain lists and conditionals of other nodes (why, yes, I do think in C++ objects, even though in Rust they're indexes into arrays held by the parser Arena...).

As I've been playing with taking my expression parser and turning it into a language, I've been thinking about how to manage more ... static? ... symbols, like function calls (currently done by defining a node that matches the function name, an open paren token, each of its parameters joined by comma tokens, and a close paren), and dynamic ones, like temporary variable scoping.

Are those things resolved in the parse tree (which makes autocompletion easier), or do we just parse to "this matched an identifier, open paren, some number of comma separated arguments, close paren, let's store that and try to resolve it later to to the function call"?

Which is related to the "Is everything an expression, with the operators doing the resolution, or do I encode typing in the parse tree, so that I know that a comparison operator returns a boolean, and ya can't just use the "+" operator to append that to a string without a type coercion?" question.

It'd be better if I had a job mandate to write this rather than it being something I'm squeezing in because I think we need it, but nobody else sees a business need...

#Comment Re: Additional writing languagesparsers made: 2025-03-12 04:57:11.778694+01 by: Mars Saxman

Generally the more sophisticated you want your compiler's behavior to be, the more likely you are to defer resolving names and assigning types. Lots of small transformations are easier to reason about than fewer large ones. In the DSL I worked on most recently, we lex, parse, and generate an abstract syntax tree (basically a parse tree minus trivial details - you don't store parens and commas, you just store an array of whatever's inside the parens and commas, etc.). Then we resolve identifiers within local scope, then we do type analysis, then we simplify control structures, etc. Many little passes, reducing a little at a time, until ultimately we emit some executable code.

So, I'd say, sure, make the data structure you're using for a parse tree *capable* of having type annotations, but that doesn't mean you need to compute them right away. You can just describe the structure of the language, and then later you can start working out what the types are, and then you can write further passes which cascade them down, and that way every piece of what you're doing is smaller and easier to reason about.

(am nerd, love this topic)

#Comment Re: Additional writing languagesparsers made: 2025-03-14 00:08:00.774712+01 by: Dan Lyke

Thanks. I'd been headed towards "let the parse tree determine everything", but obviously that starts to get pretty complex when we start looking at defining structures in partial scopes and whatnot...

And pushing the decisions on type stuff further down the path means various bits of the parse tree are more reusable.

#Comment Re: Additional writing languagesparsers made: 2025-03-16 04:41:01.815692+01 by: concept14

I'm glad I haven't had to worry about this topic in 30 years. Back in the 90s I had to deal with a system written by another team at my company. They did not have separate lexing and parsing phases; in fact they did not do any real lexing at all. This meant there were odd undocumented restrictions like you can't use variable names starting with AND or OR.

Add your own comment:

(If anyone ever actually uses Webmention/indie-action to post here, please email me)

Format with:

(You should probably use "Text" mode: URLs will be mostly recognized and linked, _underscore quoted_ text is looked up in a glossary, _underscore quoted_ (http://xyz.pdq) becomes a link, without the link in the parenthesis it becomes a <cite> tag. All <cite>ed text will point to the Flutterby knowledge base. Two enters (ie: a blank line) gets you a new paragraph, special treatment for paragraphs that are manually indented or start with "#" (as in "#include" or "#!/usr/bin/perl"), "/* " or ">" (as in a quoted message) or look like lists, or within a paragraph you can use a number of HTML tags:

p, img, br, hr, a, sub, sup, tt, i, b, h1, h2, h3, h4, h5, h6, cite, em, strong, code, samp, kbd, pre, blockquote, address, ol, dl, ul, dt, dd, li, dir, menu, table, tr, td, th

Comment policy

We will not edit your comments. However, we may delete your comments, or cause them to be hidden behind another link, if we feel they detract from the conversation. Commercial plugs are fine, if they are relevant to the conversation, and if you don't try to pretend to be a consumer. Annoying endorsements will be deleted if you're lucky, if you're not a whole bunch of people smarter and more articulate than you will ridicule you, and we will leave such ridicule in place.

Flutterby™ is a trademark claimed by

Dan Lyke

for the web publications at www.flutterby.com and www.flutterby.net.