drh, Thanks for the lengthy reply. tokenize.c has answered a lot of questions. Also, the case conventions are helping me understand lemon better. I'm going to try to pull an a late evening to get something functional or my boss will force me to write my parser without a parser generator. I'm going to try the flex tool because the stream that I have to parse is not overly complex and I'm trying to sell lemon as a way to enhance productivity and code flexibility - something tells me that hand coding a tokenizer is just going to prove to my boss that I'm creating more work for myself.
Wish me luck! RW Ron Wilson, Senior Engineer, MPR Associates, 518.831.7546 -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 18, 2007 3:45 PM To: [email protected] Subject: Re: [sqlite] a few lemon questions "Wilson, Ron" <[EMAIL PROTECTED]> wrote: > I'm bringing myself up to speed on lemon. I have a few questions. Take > this example code: > > %type multiselect_op {int} > multiselect_op(A) ::= UNION(OP). {A = @OP;} > multiselect_op(A) ::= UNION ALL. {A = TK_ALL;} > multiselect_op(A) ::= EXCEPT|INTERSECT(OP). {A = @OP;} > > 1. What does the '@' symbol mean? At first glance I thought it meant, > 'give me the literal string,' but 'A' is an integer, so that doesn't > work. How is {A = @OP;} different from {A = OP;}? The @-thing is a hack we put in for an embedded device manufacturer who was short on memory and wanted to make SQLite smaller. Within an action, "@<token>" is replaced by the numeric code for <token>. So multi(A) ::= UNION(OP). {A = @OP;} means exactly the same thing as: multi(A) ::= UNION(OP). {A = TK_UNION;} Why bother, you ask? First off, notice that it does make a difference for the rule: multi(A) ::= EXCEPT|INTERSECT(OP). {A = @OP;} In this rule, @OP can have two different values depending on which of the two tokens matched. Without the @<token> construct, we would have had to make two separate rules: multi(A) ::= EXCEPT. {A = TK_EXCEPT;} multi(A) ::= INTERSECT. {A = TK_INTERSECT;} The two rules are equivalent, but the single rule takes up less space. The reason for using {A = @OP;} as the action for UNION is that the action is character-by-character identical to the action for the EXCEPT|INTERSECT rule. And Lemon has a feature where it coaleses identical actions, thus saving additional code space. So to answer your question, the @-thing is all just an optimization to help make SQLite smaller. You can easily omit it if you find it confusing. > 2. If TK_ALL is a token, what are the other all-caps literals? I assume > they are literal text, i.e. 'UNION' is a keyword in sql. However, SQL > is not case sensitive, so explain how case is handled with these > literals. I don't think sqlite upcases all text sent to the parser, so > there must be some rule that controls the case sensitivity. Or maybe > these are also tokens? The grammar knows about terminals (tokens) and non-terminals. In Lemon, every identifier that starts with an upper-case letter is a terminal. Every identifier that starts with a lower-case letter is a non-terminal. By convention, we put all identifiers in a single case so that terminals are all upper-case and non-terminals are all lower case. But lemon really only looks at the first character. Also by convention, we make the name of terminals the same as the corresponding keyword in whatever language we are parsing. That just aids comprehension for human readers. Lemon doesn't care. Lemon goes through and assigns ascending integers to all of your terminals. It is the lexers job to associate input keywords with the integers that lemon expects to see. TK_ALL, TK_UNION, TK_EXCEPT and so forth are just place-holders for integers - a fact that you can see by looking at the header file that lemon generates. > 3. Lemon prefixes all tokens with TK_ (or whatever you define > %token_prefix). But there are some other literals that I can find no > definition for, e.g. SEMI, COMMA, etc. I think my confusion here is > related to my previous question and my next question. Part of the build process for SQLite adds a few additional integer tokens codes that lemon doesn't know about. There is an AWK script that does this after lemon runs. > 4. I don't see any documention on using a lexer (e.g. flex) with lemon. > There are some helps on the internet, but the obvous ommission leads me > to believe I don't need a lexer. Does lemon take on some of the > functionality of a lexer? If I need to define tokens as regular > expressions can I do that in lemon or do I need a lexer? > The tokenizer (a.k.a. lexer) for SQLite is hand coded. See the source file "tokenizer.c" for the implementation. I have always found that it is *much* easier to hand-code a tokenizer than to try to wrestle lex or flex into doing what I want. A hand-coded tokenizer is also generally much faster (sometimes orders of magnitude faster) than anything I've ever seen lex or flex generate. I know that most people spend at least 50% of the time in their compiler-construction course as an undergraduate talking about using regular expressions to do lexical analysis. My contention is that this time is wasted. Hand coding a lexer is not hard. See tokenizer.c for an example of how to do it. -- D. Richard Hipp <[EMAIL PROTECTED]> ------------------------------------------------------------------------ ----- To unsubscribe, send email to [EMAIL PROTECTED] ------------------------------------------------------------------------ ----- ----------------------------------------------------------------------------- To unsubscribe, send email to [EMAIL PROTECTED] -----------------------------------------------------------------------------

