Re: HQL parser internals

Jay Green-Stevens Fri, 13 Apr 2018 05:26:57 -0700

Afternoon all,

We have successfully managed to build a java tool which will translate a hive 
query into a syntax tree, and then turn this back into a hive query equivalent 
to the input.

But we have found that for a query such as
select a
from (
select a
from b
) as c

the hive parser is stripping out the ‘as’ when building the tree. This then 
means that when the query string is rebuilt the output is ‘select a from 
(select a from b) c’, and although this is technically valid it is not 
equivalent to the input query.

Is there any way we can fix this issue?
Could we change the parser in some way to stop the ‘as’ from being stripped out?

Any ideas would be greatly appreciated!

Thanks,

Jay

From: Elliot West <tea...@gmail.com<mailto:tea...@gmail.com>>
Date: 19 March 2018 at 21:33
Subject: Re: HQL parser internals
To: user@hive.apache.org<mailto:user@hive.apache.org>

Hello again,

We're now testing our system against a corpus of Hive SQL statements in an 
effort to quickly highlight edge cases, limitations etc. We're finding that 
org.apache.hadoop.hive.ql.parse.ParseUtils is stumbling on variables such as 
${hiveconf:varname}. Are variable substitutions handled prior to parsing or 
within the parser itself? If in a pre-procesing stage, is there any code or 
utility classes within Hive that we can use as a reference, or to provide this 
functionality?

Cheers,

Elliot.

On 19 February 2018 at 11:10, Elliot West 
<tea...@gmail.com<mailto:tea...@gmail.com>> wrote:
Thank you all for your rapid responses; some really useful information and 
pointers in there.

We'll keep the list updated with our progress.

On 18 February 2018 at 19:00, Dharmesh Kakadia 
<dhkaka...@gmail.com<mailto:dhkaka...@gmail.com>> wrote:
+1 for using ParseDriver for this. I also have used it to intercept and augment 
query AST.

Also, I would echo others sentiment that its quite ugly. It would be great if 
we can refactor/standardize this. That will make integrating other system a lot 
easier.

Thanks,
Dharmesh

On Sat, Feb 17, 2018 at 12:07 AM, Furcy Pin 
<pin.fu...@gmail.com<mailto:pin.fu...@gmail.com>> wrote:
Hi Elliot,

Actually, I have done quite similar work regarding Hive custom Parsing, you 
should have a look at my project: https://github.com/flaminem/flamy

The Hive parsing related stuff is here: 
https://github.com/flaminem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive
A good starting point to see how to parse queries is here:
https://github.com/flaminem/flamy/blob/master/src/main/scala/com/flaminem/flamy/parsing/hive/PopulateParserInfo.scala#L492

Basically, all you need is to use a org.apache.hadoop.hive.ql.parse.ParseDriver.

val pd: ParseDriver = new ParseDriver

val tree: ASTNode = pd.parse(query, hiveContext)

You then get the ASTNode, that you can freely parse and change.
Also, I must say that it is quite ugly to manipulate, and the Presto Parser 
seems to be much better designed (but it is not the same syntax, unfortunately),
I recommend to look at it to get better design ideas.

If you want to enrich your Hive syntax like I did (I wanted to be able to parse 
${VARS} in queries),
you will not be able to use the HiveParser without some workaround.
What I did was replacing these ${VARS} by strings "${VARS}" that the HiveParser 
would agree to parse,
and that I could recognize afterwards...

Also, if you are familiar with Scala, I recommend using it, it helps a lot...

For instance, I have this class that transforms an AST back into a string query:
https://github.com/flaminem/flamy/blob/master/src/main/scala/com/flaminem/flamy/parsing/hive/ast/SQLFormatter.scala
I could never have done something that good looking in Java...

Finally this method helps a lot to understand how the hell the AST works:
https://github.com/flaminem/flamy/blob/master/src/main/scala/com/flaminem/flamy/parsing/hive/HiveParserUtils.scala#L593

Make sure to write tons of unit tests too, you'll need them.

Hope this helps,

Furcy

On 16 February 2018 at 21:20, Gopal Vijayaraghavan 
<gop...@apache.org<mailto:gop...@apache.org>> wrote:

> However, ideally we wish to manipulate the original query as delivered by the 
> user (or as close to it as possible), and we’re finding that the tree has 
> been modified significantly by the time it hits the hook

That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook - the 
bushy join conversion is already done by the time the hook gets called.

We need a Parser hook to hook it ahead of CBO, not a Semantic Analyzer hook.

> Additionally we wish to track back ASTNodes to the character sequences in the 
> source HQL that were their origin (where sensible), and ultimately hope to be 
> able regenerate the query text from the AST.

I started work on a Hive-unparser a while back based on this class, but it a 
world of verbose coding.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java#L850

If you're doing active work on this, I'd like to help, because I need the AST 
-> query to debug CBO.

> The use case, if you are interested, is a mutation testing framework for HQL. 
> The testing of mutants is operational, but now we need to report on 
> survivors, hence the need to track back from specific query elements to 
> character sequences in the original query string.

This sounds a lot like the fuzzing random-query-gen used in Cloudera to have 
Impala vs Hive bug-for-bug compat.

https://cwiki.apache.org/confluence/download/attachments/27362054/Random%20Query%20Gen-%20Hive%20Meetup.pptx

Cheers,
Gopal

Re: HQL parser internals

Reply via email to