Issues With The Lojban Formal Grammar
It is my opinion, and that of others in the Lojban community,
that the available grammars for Lojban are not sufficiently
formalized. Please note that this page makes essentially no
reference to the YACC grammar because of its unreadability; the YACC
and BNF grammars are assumed to be equivalent.
Points we are concerned about:
- The BNF used is non-standard. In places it is very
non-standard. This hinders attempts to formally analyze
it.
- The morphology is not formalized at all. All of the
terminal productions are hand-waved and, in practice, handled by
code which thus far has been separately written by everyone to
produce a program that parses Lojban.
- The elidable terminators are not formalized at all. The
elidable terminators, again, have been handled by code which
thus far has been separately written by everyone to produce a
program that parses Lojban.
I, for one, feel that it is extremely misleading to say
that Lojban is formally parseable while this state of affairs
exists, so I'm trying to fix it.
Project Files
In order of relevance; older stuff is farther down.
- camxes, the Rats!
based PEG parser itself, as a Java JAR file. Please do not ask
me for help on running it; I'm very bad with Java.
- The by-hand modified PEG grammar
for camxes, which is mostly my work. See the "Changes Made To
The PEG Grammar" section for what I've done to it over the
automatically generated version. The current version of the
morphology to go with this, which is mostly xorxes' work, can be
found at
the BPFK morphology page.
Note that the morphology does not have to be in a separate file
and, in fact, the two files are merged before processing, but
all of the things in the morphology file are done first to make
the grammar itself easier to read. There's a bit of interface
code between the main grammar and the morphology; some of it is
in the main grammar itself, and some is in the morphology header file.
- An old version of the by-hand modified PEG morphology
- The folder for the Rats! parser
generated from the PEG above. Please do not ask me how to make
it run; I am very, very bad with Java. The Howto.cook file explains how to build
it; anything not referenced there is probably irrelevant. The
command line I use to run it from that directory is
"/usr/local/java/bin/java -Xss64m -jar lojban_peg_parser.jar",
with the sentence to parse passed on standard in. I suggest
piping it to a pager such as "less" or "more" as it is currently
set in debug mode and hence produces a lot of output.
Help making the java portions less stupid would be
appreciated!
- I have recently updated the PEG
to Rats! converter to have it produce a Rats! grammar that
automatically generate a (very ugly) parse tree. The Perl code
is even uglier than the parse tree. It doesn't deal well with
anything that could be considered lexing; all such productions
should either have "NORATS" at the start of the line or be after
the line "; --- NORATS ---" in the grammar.
- A set of test sentences for
testing changes to the PEG grammar (by comparison to the
official parser and jbofihe). Contains (in order) a bunch of
test sentences I used for testing pre-processor tokens, all the
example sentences from the refgram, all of Alice (one paragraph
per line) and all the IRC logs as of the end of March 2004.
Total is 34 thousand lines. Lines marked with "-- GOOD" are
those that I have carefully examined and determined to be valid
Lojban; those marked with "-- BAD" are invalid Lojban. These
two are generally only used when one or more parsers gets it
wrong.
- The automatically generated
PEG version. This version is programatically generated from
the expanded ABNF version.
- The perl program that did the BNF
to ABNF conversion.
- An expanded ABNF version. The
only differences between this and the file above are the
addition of productions for the various selma'o and a few
example brivla and cmene.
- An ABNF version of the grammar file.
All changes were made programmatically, and should make no
actual difference.
- The perl program that did the ABNF
to PEG conversion.
- The original bnf.300 grammar file.
- A discussion between me and John Cowan
about the elidable terminators issue. As of 10 Feb 2004, no
real conclusions were reached.
- A link to the ABNF standard,
AKA RFC 2234. ABNF is very widely used in various RFC
documents.
My Approach
I've decided CFGs (and hence BNF in any form, let alone yacc) are
simply not the right formalism for Lojban. See "Old Approach" at the
bottom of this page for an explanation as to why. I thought we were
stuck with them, though, and that the only other option for clean,
elegant formalism was a full-on context-sensitive grammar
(shudder).
I was wrong. I recently found Parsing
Expression Grammars, which seems perfect for what Lojban needs.
Note that that's "((Parsing Expression) Grammars)", not "(Parsing
(Expression Grammars))". I am currently working on a PEG for
Lojban. The inital version, which is just an automated conversion
of the BNF with some morphological information added, already parses
most of Lojban!
Please note that while I am not aiming for "bug-for-bug"
compatilibilty with either the current official parser or
the grammar definition in grammar.300, I am trying to make sure that
differences only occur in areas covered by the preprocessor section
of grammar.300, which was very much limited by what YACC was able to
handle, rather than how the Reference Grammar said the language
worked or what a listener or speaker would expect.
Methodology
I've tried to do as much as I can programatically, because makes
it easier to convince people that what I produce is equivalent to
the original grammar. The BNF is converted to ABNF, some very
simple productions are added, and the result is converted to
PEG.
Unfortunately, a fair amount of by-hand work is then required.
For one thing, the PEG grammar requires writing productions to
lexically break up the input. For another, PEG grammars are
sensitive to the order of elements, preferring earlier options, and
the BNF has several places where taking the earliest option that
matches is guaranteed to fail later.
Actual testing is done by converting the PEG grammar to a form
suitable for use by a PEG parser generator (of which I am aware of
two: Pappy,
which generates Haskell parsers, and Rats!, which
generates Java parsers). I've been using Rats!, due to having
problems with Pappy.
Improvements
- 'si' handling is quite different in the way it interacts
with "lo'u" and "zo". Current rule: 'si' is ignored in
"lo'u...le'u", but a string of 'si' *afterwards* is honored one
word per 'si', including 'si' and 'zo', because neither have any
special power in the lo'u clause. As an interesting side
effect, "lo'u mi le'u si lo'u mi le'u" works. It means "lo'u mi
lo'u mi le'u".
- BU handling actually works. It doesn't in the official parser as of 29
Mar 2004 ("bu bu broda" passes, and "ky bu bu broda" fails).
- '.y.' is completely ignored; the only way it can interact with
the rest of the grammar is as "zo .y." or as ".y. bu". It can
actually have more than one y in a row (i.e. ".yyyyy."), because
I at least often use it that way on IRC.
- Multiple BAhE in a row are allowed. It is assumed this will
be used for emphasis. Note that the current official parser
also accepts this, even though it should not according to
grammar.300.
- "su" can be backed out of with "si", allowing a speaker to
save themselves from a potentially crushing mistake.
- "!" and "?" are treated as white space. Probably other
things should be added to that list (q and w?).
- Groups of "si" and everything up to a "sa" are both erased
at the beginning of a string. This may or may not be
justifiable according to grammer.300; no-one's really
sure. This means that sentences like "si si si" and "sa"
are legal, as well as sentences like "le broda sa .i mi cusku".
- SA and SI now interact in a more obvious fashion. For
example, "le broda brode brodi .y. sa le si la broda brode
brodi" is equivalent to "la broda brode brodi". Just using "sa"
would not work because "le" and "lo" are in different
selma'o.
Interactions between ZOI, SI, and SA are much
richer. The goal is to achieve something more like what a user
would 'expect', given the basic definitions of those words.
Details:
- The first SI after the close of a ZOI clause erases the
closing delimiter, allowing one to add to the protected
text. "zoi gy weeble gy si bob gy" is equivalent to "zoi gy
weeble bob gy".
Two consecutive SI after the close of a
ZOI erases the non-Lojban text itself; while it would
theoretcially be possible to have consecutive SI after the
close of a ZOI erase individual words inside the ZOI
protected text, this is a bad idea because (for example)
breaking up a bird call into words makes very little
sense.
So, for example, "zoi gy da da da gy si si de gy" is
equivalent to "zoi gy de gy".
- The interaction of these two features leads to a
somewhat strange, but very minor, side effect: It is
impossible to add to the protected text inside a
zoi clause (i.e. using a single SI after the closing
delimiter) any text that starts with "si" (unless it then
goes on to be something that looks like a Lojban brivla or
cmene), because it will be interpreted as two SI, causing
erasure of the entire protected text.
- Three consecutive SI after the close of a ZOI erases
everything but the ZOI itself, so that, for example, "zoi gy
da da da gy si si si dy weeble dy" is equivalent to "zoi dy
weeble dy".
- Four consecutive SI after the close of a ZOI erases the
entire ZOI clause, including the ZOI.
- Similarily, after ZO+word, a single SI deletes the word,
causing the next word to be caught by the ZO, but two SI
delets both the word and ZO.
- Because of the SA and SI interaction enhancements, the
fast way to delete and accidental ZOI is to close the
delimiter and say "sa zoi si", and then continue on. For
example, "broda zoi gy da da da da gy sa zoi si da" is
equivalent to "broda da".
- Multiple zei are handeled in a different order;
historically, "broda zei zei broda" was "(broda zei zei) broda"
and "zei zei broda" was invalid. In my parser, it's "broda (zei
zei broda)", and "zei zei broda" is "zei type-of broda". This
was accidental at first, but it was pointed out that with the
old way it was essentially impossible to say "zei type-of
lujvo", which this fixes.
- None of si, sa, su, y, or zei are allowed as zoi
delimiters, since delimiters are not scarce so it doesn't make
sense to block the more useful interpretation.
- Multiple sa in a row delete back to further previous
instances of that selma'o. For example, "le le broda cu brode
sa sa le brodi" is the same as "le brodi".
- Y is ignored anywhere except in front of BU.
- As a very special case, ZOI SA clauses can accept arbitrary
strings, to handle things like "zoi foo booz foo co si si WEEB!
foo dysa zoi bar baz bar" (which, in case you're wondering is
equivalent to "zoi bar baz bar").
- Allows things like "byfy doi mark cu broda", whereas before
a boi would have been required after "byfy".
- Allows "free" in many more places.
- Allows things like ".i fi'o broda bo mi klama".
- Allows "lo broda joi lo broda" without a ku before the joi.
Limitations
- Error reporting is essentially non-existant. This may not
be fixable.
By-Hand Changes Made To The PEG Grammar
This section enumerates the changes that were made to the PEG
grammar starting from the automatically generated version.
If an entry looks like "* [number][letter] -- [stuff]", the
number and letter are a reference to a section of the pre-processing
guide in grammar.300. It means that that section is intended to
implement (or help implement) that rule.
-
Fixing up of things like (NAI+)? into NAI*, and removing extraneous (...).
-
Change of = to
<-
-
3 -- Fixed the 'text' production to ignore everything after fa'o
-
Left-factored rp-expression (this is a syntactic change only).
NOTE: I'm not completely certain I did this correctly,
so please take a look if you're into this sort of thing. The
old form:
rp-expression <- rp-operand rp-operand operator
rp-operand <- operand / rp-expression
New form:
rp-expression <- (operand / operand rp-operand operator) rp-operand operator
rp-operand <- operand / rp-expression
-
Re-ordered some selma'o productions due to how PEGs work. For
example:
FA <- "fa" / "fe" / "fi" / "fo" / "fu" / "fai" / "fi'a"
won't work because the 'fa' will match first, even if the word is actually
'fai'. Same with 'fi' and "fi'a". So it was re-ordered to:
FA <- "fai" / "fa" / "fe" / "fo" / "fu" / "fi'a" / "fi"
-
Added productions 'Spacing' and 'Spaces', for handling
whitespace.
-
4c -- Added 'post-cmavo', which goes after every cmavo string in every
selma'o. post-cmavo only accepts a string if it is not followed
by a member of BU. It also requires the string to be followed
by 'post-cmavo-spacing', which was also added.
post-cmavo-spacing allows an optional trailing '.' and either
some spaces or another cmavo. This is probably an approximation,
and will likely need to be reviewed when stricter morphology is
included.
-
-
Added some basic morphology constructions, as follows. Please
note that these are preliminary and know imperfect. For
example, "la fo''''o" is perfectly acceptable to these
preliminary rules.
- consonant and vowel
- other-letter, which is ['y,.]
- lojban-letter, whichi s any of the above
- cmene-letter, which is lojban-letter plus upper-case
versions
-
CMAVO (which contained just a list of all selma'o) was moved
to 'known-cmavo'. CMAVO became either known-cmavo or a
consonant or '.', followed by a vowel, followed by any
number of single-quote vowel pairs or vowels or both,
followed by cmavo-spacing.
-
CMENE is now an optional ".", one or more cmene-letters, a
consonant, and spacing.
-
BRIVLA is now basically a tester for consonant in the first
5 characters and ending with a vowel.
-
4e -- Added (UI NAI?)+ to the end of 'spaces'. This allows any word,
basically, to be followed by any number of UI or UI-NAI pairs.
-
2a, not working -- removed the ZOI production from sumti-6, as
this can't be made to work without a pre-processor.
-
2b -- In the ABNF version I changed 'any-word' to be a brivla or
a cmene or a cmavo. This, along with the alread extant "ZO
any-word" rule, handles zo.
-
2c -- added lohu-tail to handle lo'u...[first le'u]
-
2e -- Added 'si-clause' to 'spaces', which takes any nesting
"word si" pairs. Also tweaked ZO and LOhU so that they refuse
to process SI in this way. This required making a copy of
'any-word' that wouldn't handle SI clauses, and a fair bit of
tweaking of LOhU. Also added si-clause as on option to the
beginning of text. 'si' seems to be working with lo'u.
-- incomple; "zo si si mi".
- 4c -- Re-ordered sumti-6 to start with ZO, LOhU, and LU productions, in
that order. This fixed BU interaction with LOhU.
- 4c -- Re-ordered fragment to put 'terms' out in front, to fix ".abu"
(and probably other BU problems)
- 4c -- Re-ordered sumti-6 to put BU just before LU.
- 2b,2e -- Added a SI clause to zo, allowing things like "zo si .y. si
fi" to do the thing a human would expect.
- 4e -- Added Y to all the spaces functions, so that the example above
*actually* works. Created "absorb-indicators" to do this. Scattered "Y*"
liberally throughout the grammar, so it's ignored basically
everywhere.
- 4c -- Made "indicator" not work before BU. Allowed Y
without BU at the beginning of text as a free token.
- Added '.' to the spaces functions, so it's treated just like
' '. Much easier. Removed the leading '.' from relevant cmavo.
This necessitated changes to the CMAVO and CMENE productions.
- 4e -- Added a special case to allow "zo y" and "y bu" to
work ("y" is ignored everywhere else).
- 4e -- added "NAI CAI?", DAhO, FUhO and FUhE to
absorb-indicators.
- 4a,others? -- Added a second option to 'sentence', which
contains only 'bridi-tail', so ZEI will work properly in cases
where the first word could be a sumti.
- 4a -- moved ZEI productions to the front of
tanru-unit-2.
- 4b -- Added 'pre-cmavo' to all selma'o except BAhE, SI and
BU. Put BAhE in pre-cmavo. Aded a special case for "BAhE
BU".
- 2g -- Added "su-clause" to the beginning of text, to handles
starting SU clauses.
- 2g -- Added "su-clause" after NIhO and TUhE; LU and TO
contain 'text' already.
- 2f -- Added "[selma'o]-sa-clause" to *every* selma'o, along
with "[selmaho]-no-SA-handling", and any-word-no-SA-handling and
friends..
- Reordered indicators & free in text to have "indicators
free+" be first.
- Added 'text-1' and 'paragraphs?' to text-1 to match the
YACC grammar (bug in the BNF).
- Added a second clause to statement-2 and
statement-3 so that sentences with statement clauses would be
preferred.
- Fixed a parenthesis error in sumti-6.
- Made the morphology a bit more sane.
- Reordered space-interval to make the longest option come
first.
- Added an option to text without the CMENE eater, so that
"bab zei bab", for example, works at the start of text, instead
of having the first 'bab' eaten by 'text' itself.
- More morphology fixes.
- Fixed up reverse polish notation again; seems to actually work
now.
- Re-ordered vocative to handles "coi doi".
- Re-ordered 'text' massively to have joiks not
necessarily get eaten. Also made the 'paragraphs' in
text-1 not optional, but made text-1 optional in 'text' (which
was the same thing).
- Added "!gek" to "term" to allow things like "mi pu gi
[stuff] gi [stuff]" to not try to treat the second term as the
start of a tensed sumti.
- Added "!(stag? BO) !(stag? KE)" to bridi-tail-1 to have it
not eat giheks at every possible opportunity.
- Added "!MOI" to quantifier to allow "pamoi" and such to
work.
- Fixed BRIVLA to not end in 'y' and CMENE to not stop just
before a 'y'.
- Added a check to not match cmavo if they are *immediately*
followed by a string of non-spaces ending in a consonant.
- Explained to the two occurences of lerfu-string that aren't
followed by MOI that if they are followed by MOI they
shouldn't match.
- Fixed a translation bug; at some point the optionality got
dropped from the first "NUhU free*" in termset.
- Fixed a bug in text-1: instead of allowing ijek [text-1]
xor paragraphs, it required both.
- Allowed ! and ? as spaces.
- Prioritized sumti-tail to not have a quantifier, so "le pa
roi broda" would work.
- Added su-clause to NIhO cases that didn't have it.
- Added a special case to allow 'su' to be the first word in a
text block.
- Expanded text and text-1 to more accurately match
grammar.300, and to have better preferential behaviour.
- Re-ordered simple-tense-modal to prefer ((time space? /
space time?) CAhA).
- Tried to clean up the morphology, once again, by moving
morphology checks to before words.
- Minor, cosmetic changes to make things work better with
peg2rats.pl
- Implementation of 'si' and 'sa' at the beginning of
strings.
- Fixed zo so that zo + any-word is itself a possible outcome
of any-word, so that "zo irk zei broda" works, for example.
- ZOI handling added using semantic predicates and parser
actions (and a few changes to peg2rats.pl).
- Added EOF and not cases to "text". There might be a better
way of handing the 12 (!) productions that "text" now has, but I
don't know what it is. Broke out parts of "text" as
sub-productions as part of this.
- Implemented multiple "sa" handling, such that each extra sa
takes things back to an early instance of the following selma'o.
As a side effect, there can be any number of sa at the beginning
of a string.
- Cleaned up space handling a bit.
- Re-ordered "fragment" to maximally prefer prenexes.
- Various tweaking of magic word handling. BU handling is now
inline with BPFK decisions as of 16 Jun 2004.
- Various re-ordering, re-naming and addition of tags to help
with automated implementation of various parser features.
- Fixed a bug where "lu na jo li'u" would be read as "lu
fragment( na ) [elidede li'u] joik-jek [errore]". This only
affected "NA JA" inside a quoting structure that referred back
to "text".
- Fixed a bug where "la cmen zei [anything]" would fail due to
"la cmen" being preferred.
- Moved indicators around so they would be visible without -b.
- Fixed a bug with ba'e and indicators.
- Change (BOI free*)? in sumti-6 to BOI? free*. This allows
things like "byfy doi mark cu broda", whereas before a boi would
have been required after "byfy". In the case of conflict, "MAI"
wins.
- Added "!BU" after BRIVLA, so the bare sentence "slaka bu"
would work.
- Stopped ZEI from absorbing indicators.
- Fixed a bug in lerfu-string + MAI handling where "le me
bypyfyky moi" would break into "(le mi bypyfy) (ky moi)", rather
than lerfu-string being greedy (thus causing failure).
- Change (BOI free*)? to BOI? free* in the other few places it
occured.
- Fixed FUhE to not absorb indicators (doesn't work too well
otherwise).
- Added indictor absorbtion to bu-clause.
- In tanru-unit-2, changed
"ME free* sumti (MEhU free*)? (MOI free*)?"
to
"ME free* (sumti / lerfu-string) (MEhU free*)? (MOI free*)?"
because that seems to have been the intention, and it was broken
by me scattering !MOI in various lerfu-string situations.
-
Changed all things like (TERM free*)? to TERM? free*.
This should allow free in many more places. It seems to have
made no changes WRT the test suite, but is still a somewhat
experimental change. This change took place is RCS version
1.26. Note that it has now been thoroughly tested by comparing
the tree output of the two versions for everything in
test-sentences.txt; no changes were found except for in
sentences exploiting the new stag functionality (see below).
- Also made stag equivalent to tag, which allows things like
".i fi'o broda bo mi klama". This is rather less experimental;
we understand pretty well what it does. Same change number.
- Fixed up a very subtle interaction bug between normal ZOI
and ZOI in a si clause.
- Seperated morphology into a separate file,
lojban_morphology.peg.
- Many changes to cause Magic Words handling to conform to the
current BPFK proposal (as of 30 Nov 2004).
- Added !selbri-1 after tag in term to stop "mi bai klama"
from being read as "mi bai ku klama". -- NOT SUFFICIENT; breaks
"mi broda lo nu brodo da ca brode".
- Discovered unnecessary complexity in statement-2;
streamlined to match the BNF after no reason for it could be
found.
- Turned all (ek / joik) into joik-ek.
- Broke mex-forethought and fore-operands out of mex-2 for
clearer parse trees.
Old Approach
For a while I was trying to adjust the BNF grammar (after
conversion to ABNF) to do The Right Thing with respect to elidable
terminators, because that's the hard part. I have since come to the
conclusion that elidable terminators can merely be made optional if
longest-match disambiguation is used, but that puts us in the realm
of specifying the parser again, which is what I was trying to
avoid.
I still think that one could probably get BNF to do The Right
Thing with respect to elidable terminators, but it would be Very,
VERY hard. I would be surprised if it could be done
without expanding the grammar by a factor of 20 or so. No, that's
not an exaggeration or a joke.
Update 20 Sep 2005: I am no longer so sure that it's possible,
but I don't have any good reason; just a change in my gut feeling. I
also think a 20 times size increase is probably conservative.
To give you a sense of what I mean, consider fixing 'kei'. This
requires having the grammar descending from a NU clause to eat all
brivla it sees until the next kei. Because BNF is inherently
ambiguous, forcing this requires that every place where two brivla
could occur next to each other be re-written`to only form two
separate selbri when there is a kei between them, but only inside a
NU clause. If this is possible in BNF/CFGs, and I'm not totally
certain it is, it requires nearly doubling the size of the grammar
because you have to have everything under 'subsentence' copied into
a "[foo]_during_NU" form, or whatever.
When you're done with that, try another big elidable terminator,
like 'ku'. This will require the same thing, but the ku additions
to the grammar and the nu additions to the grammar must work nested,
in either order. That's two more complete sets, not including the
'ku' or 'kei' sets. You now have a grammar on the order of four
times the original size, and you've fixed only two elidable
terminators.
Good luck; let me know when you're done.