Re: [sqlite] a few lemon questions

drh Tue, 18 Dec 2007 12:45:00 -0800

"Wilson, Ron" <[EMAIL PROTECTED]> wrote:
> I'm bringing myself up to speed on lemon.  I have a few questions.  Take
> this example code:
> 
> %type multiselect_op {int}
> multiselect_op(A) ::= UNION(OP).             {A = @OP;}
> multiselect_op(A) ::= UNION ALL.             {A = TK_ALL;}
> multiselect_op(A) ::= EXCEPT|INTERSECT(OP).  {A = @OP;}
> 
> 1. What does the '@' symbol mean?  At first glance I thought it meant,
> 'give me the literal string,' but 'A' is an integer, so that doesn't
> work.  How is {A = @OP;} different from {A = OP;}?


The @-thing is a hack we put in for an embedded device manufacturer
who was short on memory and wanted to make SQLite smaller.  Within 
an action, "@<token>" is replaced by the numeric code for <token>.
So

    multi(A) ::= UNION(OP).  {A = @OP;}

means exactly the same thing as:

    multi(A) ::= UNION(OP).  {A = TK_UNION;}

Why bother, you ask?  First off, notice that it does make a difference
for the rule:

    multi(A) ::= EXCEPT|INTERSECT(OP).  {A = @OP;}

In this rule, @OP can have two different values depending on which
of the two tokens matched.  Without the @<token> construct, we would
have had to make two separate rules:

    multi(A) ::= EXCEPT.    {A = TK_EXCEPT;}
    multi(A) ::= INTERSECT. {A = TK_INTERSECT;}

The two rules are equivalent, but the single rule takes up less space.

The reason for using {A = @OP;} as the action for UNION is that the
action is character-by-character identical to the action for the
EXCEPT|INTERSECT rule.  And Lemon has a feature where it coaleses
identical actions, thus saving additional code space.

So to answer your question, the @-thing is all just an optimization
to help make SQLite smaller.  You can easily omit it if you find it
confusing.

> 2. If TK_ALL is a token, what are the other all-caps literals?  I assume
> they are literal text, i.e. 'UNION' is a keyword in sql.  However, SQL
> is not case sensitive, so explain how case is handled with these
> literals.  I don't think sqlite upcases all text sent to the parser, so
> there must be some rule that controls the case sensitivity.  Or maybe
> these are also tokens?

The grammar knows about terminals (tokens) and non-terminals.  In
Lemon, every identifier that starts with an upper-case letter is a
terminal.  Every identifier that starts with a lower-case letter
is a non-terminal.  By convention, we put all identifiers in a single
case so that terminals are all upper-case and non-terminals are all
lower case.  But lemon really only looks at the first character.

Also by convention, we make the name of terminals the same as
the corresponding keyword in whatever language we are parsing.
That just aids comprehension for human readers.  Lemon doesn't care.
Lemon goes through and assigns ascending integers to all of your
terminals.  It is the lexers job to associate input keywords with
the integers that lemon expects to see.  TK_ALL, TK_UNION, TK_EXCEPT
and so forth are just place-holders for integers - a fact that you
can see by looking at the header file that lemon generates.

> 3. Lemon prefixes all tokens with TK_ (or whatever you define
> %token_prefix).  But there are some other literals that I can find no
> definition for, e.g. SEMI, COMMA, etc.  I think my confusion here is
> related to my previous question and my next question.

Part of the build process for SQLite adds a few additional integer
tokens codes that lemon doesn't know about.  There is an AWK script 
that does this after lemon runs.

> 4. I don't see any documention on using a lexer (e.g. flex) with lemon.
> There are some helps on the internet, but the obvous ommission leads me
> to believe I don't need a lexer.  Does lemon take on some of the
> functionality of a lexer?  If I need to define tokens as regular
> expressions can I do that in lemon or do I need a lexer?
> 

The tokenizer (a.k.a. lexer) for SQLite is hand coded.  See the
source file "tokenizer.c" for the implementation.

I have always found that it is *much* easier to hand-code a tokenizer
than to try to wrestle lex or flex into doing what I want.  A
hand-coded tokenizer is also generally much faster (sometimes 
orders of magnitude faster) than anything I've ever seen lex or
flex generate.  I know that most people spend at least 50% of
the time in their compiler-construction course as an undergraduate
talking about using regular expressions to do lexical analysis.
My contention is that this time is wasted.  Hand coding a lexer
is not hard.  See tokenizer.c for an example of how to do it.

--
D. Richard Hipp <[EMAIL PROTECTED]>


-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Re: [sqlite] a few lemon questions

Reply via email to