Afternoon all, We have successfully managed to build a java tool which will translate a hive query into a syntax tree, and then turn this back into a hive query equivalent to the input.
But we have found that for a query such as select a from ( select a from b ) as c the hive parser is stripping out the ‘as’ when building the tree. This then means that when the query string is rebuilt the output is ‘select a from (select a from b) c’, and although this is technically valid it is not equivalent to the input query. Is there any way we can fix this issue? Could we change the parser in some way to stop the ‘as’ from being stripped out? Any ideas would be greatly appreciated! Thanks, Jay From: Elliot West <tea...@gmail.com<mailto:tea...@gmail.com>> Date: 19 March 2018 at 21:33 Subject: Re: HQL parser internals To: user@hive.apache.org<mailto:user@hive.apache.org> Hello again, We're now testing our system against a corpus of Hive SQL statements in an effort to quickly highlight edge cases, limitations etc. We're finding that org.apache.hadoop.hive.ql.parse.ParseUtils is stumbling on variables such as ${hiveconf:varname}. Are variable substitutions handled prior to parsing or within the parser itself? If in a pre-procesing stage, is there any code or utility classes within Hive that we can use as a reference, or to provide this functionality? Cheers, Elliot. On 19 February 2018 at 11:10, Elliot West <tea...@gmail.com<mailto:tea...@gmail.com>> wrote: Thank you all for your rapid responses; some really useful information and pointers in there. We'll keep the list updated with our progress. On 18 February 2018 at 19:00, Dharmesh Kakadia <dhkaka...@gmail.com<mailto:dhkaka...@gmail.com>> wrote: +1 for using ParseDriver for this. I also have used it to intercept and augment query AST. Also, I would echo others sentiment that its quite ugly. It would be great if we can refactor/standardize this. That will make integrating other system a lot easier. Thanks, Dharmesh On Sat, Feb 17, 2018 at 12:07 AM, Furcy Pin <pin.fu...@gmail.com<mailto:pin.fu...@gmail.com>> wrote: Hi Elliot, Actually, I have done quite similar work regarding Hive custom Parsing, you should have a look at my project: https://github.com/flaminem/flamy The Hive parsing related stuff is here: https://github.com/flaminem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive A good starting point to see how to parse queries is here: https://github.com/flaminem/flamy/blob/master/src/main/scala/com/flaminem/flamy/parsing/hive/PopulateParserInfo.scala#L492 Basically, all you need is to use a org.apache.hadoop.hive.ql.parse.ParseDriver. val pd: ParseDriver = new ParseDriver val tree: ASTNode = pd.parse(query, hiveContext) You then get the ASTNode, that you can freely parse and change. Also, I must say that it is quite ugly to manipulate, and the Presto Parser seems to be much better designed (but it is not the same syntax, unfortunately), I recommend to look at it to get better design ideas. If you want to enrich your Hive syntax like I did (I wanted to be able to parse ${VARS} in queries), you will not be able to use the HiveParser without some workaround. What I did was replacing these ${VARS} by strings "${VARS}" that the HiveParser would agree to parse, and that I could recognize afterwards... Also, if you are familiar with Scala, I recommend using it, it helps a lot... For instance, I have this class that transforms an AST back into a string query: https://github.com/flaminem/flamy/blob/master/src/main/scala/com/flaminem/flamy/parsing/hive/ast/SQLFormatter.scala I could never have done something that good looking in Java... Finally this method helps a lot to understand how the hell the AST works: https://github.com/flaminem/flamy/blob/master/src/main/scala/com/flaminem/flamy/parsing/hive/HiveParserUtils.scala#L593 Make sure to write tons of unit tests too, you'll need them. Hope this helps, Furcy On 16 February 2018 at 21:20, Gopal Vijayaraghavan <gop...@apache.org<mailto:gop...@apache.org>> wrote: > However, ideally we wish to manipulate the original query as delivered by the > user (or as close to it as possible), and we’re finding that the tree has > been modified significantly by the time it hits the hook That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook - the bushy join conversion is already done by the time the hook gets called. We need a Parser hook to hook it ahead of CBO, not a Semantic Analyzer hook. > Additionally we wish to track back ASTNodes to the character sequences in the > source HQL that were their origin (where sensible), and ultimately hope to be > able regenerate the query text from the AST. I started work on a Hive-unparser a while back based on this class, but it a world of verbose coding. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java#L850 If you're doing active work on this, I'd like to help, because I need the AST -> query to debug CBO. > The use case, if you are interested, is a mutation testing framework for HQL. > The testing of mutants is operational, but now we need to report on > survivors, hence the need to track back from specific query elements to > character sequences in the original query string. This sounds a lot like the fuzzing random-query-gen used in Cloudera to have Impala vs Hive bug-for-bug compat. https://cwiki.apache.org/confluence/download/attachments/27362054/Random%20Query%20Gen-%20Hive%20Meetup.pptx Cheers, Gopal