Re: Many ANTLR Tokens

2020-04-10 Thread David Mollitor
Hello Gang,

I've been investigating this issue.

This should no longer be an issue with ANTL4 (and ANTLR3 is long since
stopped seeing development circa 2014).  However, ANTL4 is not fully
backwards compatible with ANTL3.  In particular, ANTL4 changes how it
approaches "rewrite rules" operations.  ANTLR3 heavily uses these
operations and therefore it is quite a lift to get this upgrade done.  Not
to mention, as I work on fixing some of these things, we may want to
backport to HIVE 3.x branches.

https://issues.apache.org/jira/browse/HIVE-23177


I also looked at possibly writing a tool that will break up the java file
that ANTL3 produces into smaller pieces, but this would require that I
create another Maven module in Hive just for this purpose.  It would be a
custom Maven Plugin that performs this action of reading in the source code
and then chopping it up a bit to make the compiler happy.  This is
possible, but adds quite a bit of overheard to the project (yet another
Maven module to manage).


We can also just remove the duplicate token names.  I understand that its
design grants flexibility, but SQL is a pretty tight standard at this point
and I don't see Hive leveraging this in any meaningful way.  This would be
the path of least resistance.

Thoughts?

Thanks.

On Thu, Apr 9, 2020 at 6:36 PM David Mollitor  wrote:

> Hello Gang,
>
> I am investigating HIVE-23172 and I am having a problem addressing this
> because I am getting the following error from compiling the grammar:
>
> hive-parser: Compilation failure
> [ERROR]
> /home/apache/hive/hive/parser/target/generated-sources/antlr3/org/apache/hadoop/hive/ql/parse/HiveParser.java:[40,38]
> code too large
>
> I traced it down to the fact that there are too many token defined.  In
> HiveParser.java, it has the following:
>
>  public static final String[] tokenNames = new String[] { ... };
>
> That list is so long, it's breaking Java compilation.  Someone else came
> across this awhile ago: HIVE-15577.
>
> I observed that the parser defines two token for most elements, for
> example:
>
> KW_TRUNCATE / TOK_TRUNCATETABLE
>
> What is the value of having both?  Can we consolidate this down to one and
> conserve some space?  I would propose just using  TOK_TRUNCATE and get rid
> of the KW version.
>
> Does anyone have an insight into why things are setup the way they are?
>


Many ANTLR Tokens

2020-04-09 Thread David Mollitor
Hello Gang,

I am investigating HIVE-23172 and I am having a problem addressing this
because I am getting the following error from compiling the grammar:

hive-parser: Compilation failure
[ERROR]
/home/apache/hive/hive/parser/target/generated-sources/antlr3/org/apache/hadoop/hive/ql/parse/HiveParser.java:[40,38]
code too large

I traced it down to the fact that there are too many token defined.  In
HiveParser.java, it has the following:

 public static final String[] tokenNames = new String[] { ... };

That list is so long, it's breaking Java compilation.  Someone else came
across this awhile ago: HIVE-15577.

I observed that the parser defines two token for most elements, for example:

KW_TRUNCATE / TOK_TRUNCATETABLE

What is the value of having both?  Can we consolidate this down to one and
conserve some space?  I would propose just using  TOK_TRUNCATE and get rid
of the KW version.

Does anyone have an insight into why things are setup the way they are?