An ANTLR Book
A book about ANTLR is something that I have long lusted after. It is something that I keep planning to learn, but the learning curve is too high to do this informally. Looks like I am getting my wish, The Definitive ANTLR Reference is in beta book now.
ANTLR is a compiler compiler, which makes it a great tool to write languages (Boo's parser is written in ANTLR, for instance). And although I would generally use Boo for DSL, it is important to me to understand the parser as well.
Comments
That's great news - I've had a few goes at getting to grips with ANTLR, but as you say, it's not something you can just pick up casually...
Isn't what you need just an understanding of EBNF and how to write an LR(n) parser? Or if you don't need LR(n), just use LL(n) ?
It fits on a single sheet of paper :)
Frans, probably.
I don't know them.
ENBF is the notation of the syntaxis:
Nonterminal:: Terminal|Nonterminal
etc.
It's pretty simple so once you grasp that it's easy to produce a syntax. See:
http://en.wikipedia.org/wiki/Extended_Backus_Naur_Form
LR(n) is a parsing method where you use a stack and if a syntax rule is fullfilled with the items on the stack, you replace that set of items with the non-terminal of that rule.
E.g. when you have:
Statement :: ( "Foo" ) ;
and you have this on the stack:
(
"Foo"
)
;
you can 'reduce' these 4 elements to 'Statement' and the whole process starts over. LR(n) parsers are using tables, so you have a generic parser engine and a set of tables which are specific for a language. You also provide a set of routines, for every rule one routine. So when a reduce takes place (like in my example) the rule handler for that rule gets the elements reduced and thus can for example emit something based on the tokens passed in.
The 'n' is for the # of items to look ahead before deciding what to do. This can solve ambiguistic elements in your language. Most of the time n=1
LL(n) is a different parsing method. There you simply scan a token and call the handler for the token and that handler then checks what the next token is, if that's an expected token it continues otherwise it gives up and control starts with the main routine.
ANTLR, as the name implies, generates an LR(n) parsing environment for you, so you provide the EBNF syntax and the handlers. The syntax is used to generate the tables and the handlers are provided by you so it will form the parsing code together with the antlr parsing engine.
Not every syntax is parsable by an LR(n) parser. For example UBB forum syntax is hard to do without conflicts as the language is ambiguistic. You then can better write an LL(n) parser, which is straight forward but more work as you can't generate it from syntax most of the time.
Nice to see more interest in DSL's. I think a combination of DSL's inside a single program is the future, and in that light I find it stupid MS didn't create a DSL aware environment in .NET 3.5 where Linq is a DSL inside another language like C#. It then would be possible to create other DSL's as well which could be embedded inside C# code.
Oh well... :)
Btw, if you want an example of an LL(1) parser, my LL(1) parser for UBB in C# can be found here in the HnD sourcecode:
http://www.llblgen.com/HnD
Frans,
Thanks for the detailed explanation.
I am interested in parsing because I feel it is something that I miss.
About DSL, I certainly agree that we are going to see a lot more of those in the future.
Although I think that languages such as boo, which are open to extendability, but already do handle a lot of the complexity for you are the way to go.
Comment preview