Issues With The Lojban Formal Grammar

It is my opinion, and that of others in the Lojban community, that the available grammars for Lojban are not sufficiently formalized. Please note that this page makes essentially no reference to the YACC grammar because of its unreadability; the YACC and BNF grammars are assumed to be equivalent.

Points we are concerned about:

I, for one, feel that it is extremely misleading to say that Lojban is formally parseable while this state of affairs exists, so I'm trying to fix it.

Project Files

In order of relevance; older stuff is farther down.

My Approach

I've decided CFGs (and hence BNF in any form, let alone yacc) are simply not the right formalism for Lojban. See "Old Approach" at the bottom of this page for an explanation as to why. I thought we were stuck with them, though, and that the only other option for clean, elegant formalism was a full-on context-sensitive grammar (shudder).

I was wrong. I recently found Parsing Expression Grammars, which seems perfect for what Lojban needs. Note that that's "((Parsing Expression) Grammars)", not "(Parsing (Expression Grammars))". I am currently working on a PEG for Lojban. The inital version, which is just an automated conversion of the BNF with some morphological information added, already parses most of Lojban!

Please note that while I am not aiming for "bug-for-bug" compatilibilty with either the current official parser or the grammar definition in grammar.300, I am trying to make sure that differences only occur in areas covered by the preprocessor section of grammar.300, which was very much limited by what YACC was able to handle, rather than how the Reference Grammar said the language worked or what a listener or speaker would expect.

Methodology

I've tried to do as much as I can programatically, because makes it easier to convince people that what I produce is equivalent to the original grammar. The BNF is converted to ABNF, some very simple productions are added, and the result is converted to PEG.

Unfortunately, a fair amount of by-hand work is then required. For one thing, the PEG grammar requires writing productions to lexically break up the input. For another, PEG grammars are sensitive to the order of elements, preferring earlier options, and the BNF has several places where taking the earliest option that matches is guaranteed to fail later.

Actual testing is done by converting the PEG grammar to a form suitable for use by a PEG parser generator (of which I am aware of two: Pappy, which generates Haskell parsers, and Rats!, which generates Java parsers). I've been using Rats!, due to having problems with Pappy.

Improvements

Limitations

By-Hand Changes Made To The PEG Grammar

This section enumerates the changes that were made to the PEG grammar starting from the automatically generated version.

If an entry looks like "* [number][letter] -- [stuff]", the number and letter are a reference to a section of the pre-processing guide in grammar.300. It means that that section is intended to implement (or help implement) that rule.

Old Approach

For a while I was trying to adjust the BNF grammar (after conversion to ABNF) to do The Right Thing with respect to elidable terminators, because that's the hard part. I have since come to the conclusion that elidable terminators can merely be made optional if longest-match disambiguation is used, but that puts us in the realm of specifying the parser again, which is what I was trying to avoid.

I still think that one could probably get BNF to do The Right Thing with respect to elidable terminators, but it would be Very, VERY hard. I would be surprised if it could be done without expanding the grammar by a factor of 20 or so. No, that's not an exaggeration or a joke.

Update 20 Sep 2005: I am no longer so sure that it's possible, but I don't have any good reason; just a change in my gut feeling. I also think a 20 times size increase is probably conservative.

To give you a sense of what I mean, consider fixing 'kei'. This requires having the grammar descending from a NU clause to eat all brivla it sees until the next kei. Because BNF is inherently ambiguous, forcing this requires that every place where two brivla could occur next to each other be re-written`to only form two separate selbri when there is a kei between them, but only inside a NU clause. If this is possible in BNF/CFGs, and I'm not totally certain it is, it requires nearly doubling the size of the grammar because you have to have everything under 'subsentence' copied into a "[foo]_during_NU" form, or whatever.

When you're done with that, try another big elidable terminator, like 'ku'. This will require the same thing, but the ku additions to the grammar and the nu additions to the grammar must work nested, in either order. That's two more complete sets, not including the 'ku' or 'kei' sets. You now have a grammar on the order of four times the original size, and you've fixed only two elidable terminators.

Good luck; let me know when you're done.