[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105301#comment-14105301 ] Lianhui Wang commented on HIVE-7384: i think current spark already support hash by join_col,sort by {join_col,tag}. because in spark map's shuffleWriter hash by Key.hashcode and sort by Key and in Hive HiveKey class already define the hashcode. so that can support hash by HiveKey.hashcode, sort by HiveKey's bytes Research into reduce-side join [Spark Branch] - Key: HIVE-7384 URL: https://issues.apache.org/jira/browse/HIVE-7384 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Szehon Ho Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, sales_products.txt, sales_stores.txt Hive's join operator is very sophisticated, especially for reduce-side join. While we expect that other types of join, such as map-side join and SMB map-side join, will work out of the box with our design, there may be some complication in reduce-side join, which extensively utilizes key tag and shuffle behavior. Our design principle prefers to making Hive implementation work out of box also, which might requires new functionality from Spark. The tasks is to research into this area, identifying requirements for Spark community and the work to be done on Hive to make reduce-side join work. A design doc might be needed for this. For more information, please refer to the overall design doc on wiki. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106343#comment-14106343 ] Lianhui Wang commented on HIVE-7384: @Szehon Ho yes,i read OrderedRDDFunctions code and discove that sortByKey actually does a range-partition. we need to replace range-partition with hash partition. so spark maybe should create a new interface example: partitionSortByKey. @Brock Noland code in 1) means when sample data and more than one reducers, Hive does a total order sort. so join does not sample data, it does not need a total order sort. 2) i think we really need auto-parallelism. before i talk it with Reynold Xin, spark need to support re-partition mapoutput's data as same as tez does. Research into reduce-side join [Spark Branch] - Key: HIVE-7384 URL: https://issues.apache.org/jira/browse/HIVE-7384 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Szehon Ho Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, sales_products.txt, sales_stores.txt Hive's join operator is very sophisticated, especially for reduce-side join. While we expect that other types of join, such as map-side join and SMB map-side join, will work out of the box with our design, there may be some complication in reduce-side join, which extensively utilizes key tag and shuffle behavior. Our design principle prefers to making Hive implementation work out of box also, which might requires new functionality from Spark. The tasks is to research into this area, identifying requirements for Spark community and the work to be done on Hive to make reduce-side join work. A design doc might be needed for this. For more information, please refer to the overall design doc on wiki. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106407#comment-14106407 ] Lianhui Wang commented on HIVE-7384: i think the thoughts is same as ideas that you said before. like HIVE-7158, that will auto-calculate the number of reducers based on some input from Hive (upper/lower bound). Research into reduce-side join [Spark Branch] - Key: HIVE-7384 URL: https://issues.apache.org/jira/browse/HIVE-7384 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Szehon Ho Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, sales_products.txt, sales_stores.txt Hive's join operator is very sophisticated, especially for reduce-side join. While we expect that other types of join, such as map-side join and SMB map-side join, will work out of the box with our design, there may be some complication in reduce-side join, which extensively utilizes key tag and shuffle behavior. Our design principle prefers to making Hive implementation work out of box also, which might requires new functionality from Spark. The tasks is to research into this area, identifying requirements for Spark community and the work to be done on Hive to make reduce-side join work. A design doc might be needed for this. For more information, please refer to the overall design doc on wiki. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-3430) group by followed by join with the same key should be optimized
[ https://issues.apache.org/jira/browse/HIVE-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13717883#comment-13717883 ] Lianhui Wang commented on HIVE-3430: Yin Huai,very nice work! group by followed by join with the same key should be optimized --- Key: HIVE-3430 URL: https://issues.apache.org/jira/browse/HIVE-3430 Project: Hive Issue Type: Improvement Components: Query Processor Affects Versions: 0.10.0 Reporter: Namit Jain -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4506) use one map reduce to join multiple small tables
[ https://issues.apache.org/jira/browse/HIVE-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13650371#comment-13650371 ] Lianhui Wang commented on HIVE-4506: Fern, can you provide your sql? if these tables used the same column in join clause, it used one mr. example: explain SELECT /*+mapjoin(src2,src3)*/ src1.key, src3.value FROM src src1 JOIN src src2 ON (src1.key = src2.key) JOIN src src3 ON (src1.key = src3.key); use one map reduce to join multiple small tables - Key: HIVE-4506 URL: https://issues.apache.org/jira/browse/HIVE-4506 Project: Hive Issue Type: Wish Affects Versions: 0.10.0 Reporter: Fern Priority: Minor I know we can use map side join for small table. by my test, if I use HQL like this -- select /*+mapjoin(b,c)*/... from a left join b on ... left join c on ... --- b and c are both small tables, I expect do the join in one map reduce using map side join. Actually, it would generate two map-reduce jobs by sequence. Sorry, currently I am just a user of hive and not dig into the code, so this is what I expect but I have no idea about how to improve now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4506) use one map reduce to join multiple small tables
[ https://issues.apache.org/jira/browse/HIVE-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13650380#comment-13650380 ] Lianhui Wang commented on HIVE-4506: if these have difference column, HIVE-3784 resolved one big table with multiple small tables. use one map reduce to join multiple small tables - Key: HIVE-4506 URL: https://issues.apache.org/jira/browse/HIVE-4506 Project: Hive Issue Type: Wish Affects Versions: 0.10.0 Reporter: Fern Priority: Minor I know we can use map side join for small table. by my test, if I use HQL like this -- select /*+mapjoin(b,c)*/... from a left join b on ... left join c on ... --- b and c are both small tables, I expect do the join in one map reduce using map side join. Actually, it would generate two map-reduce jobs by sequence. Sorry, currently I am just a user of hive and not dig into the code, so this is what I expect but I have no idea about how to improve now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4429) Nested ORDER BY produces incorrect result
[ https://issues.apache.org/jira/browse/HIVE-4429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13643453#comment-13643453 ] Lianhui Wang commented on HIVE-4429: hi, Mihir Kulkarni i run the first sql of your cases, but in my hive-0.9, it produces correct result.it is the following. 30.01.0 20.01.0 10.01.0 30.02.0 20.02.0 10.02.0 30.03.0 20.03.0 10.03.0 60.04.0 50.04.0 40.04.0 60.05.0 50.05.0 40.05.0 60.06.0 50.06.0 40.06.0 so can you tell which version you used. Nested ORDER BY produces incorrect result - Key: HIVE-4429 URL: https://issues.apache.org/jira/browse/HIVE-4429 Project: Hive Issue Type: Bug Components: Query Processor, SQL, UDF Affects Versions: 0.9.0 Environment: Red Hat Linux VM with Hive 0.9 and Hadoop 2.0 Reporter: Mihir Kulkarni Priority: Critical Attachments: Hive_Command_Script.txt, HiveQuery.txt, Test_Data.txt Nested ORDER BY clause doesn't honor the outer one in specific case. The below query produces result which honors only the inner ORDER BY clause. (it produces only 1 MapRed job) {code:borderStyle=solid} SELECT alias.b0 as d0, alias.b1 as d1 FROM (SELECT test.a0 as b0, test.a1 as b1 FROM test ORDER BY b1 ASC, b0 DESC) alias ORDER BY d0 ASC, d1 DESC; {code} On the other hand the query below honors the outer ORDER BY clause which produces the correct result. (it produces 2 MapRed jobs) {code:borderStyle=solid} SELECT alias.b0 as d0, alias.b1 as d1 FROM (SELECT test.a0 as b0, test.a1 as b1 FROM test ORDER BY b1 ASC, b0 DESC) alias ORDER BY d0 DESC, d1 DESC; {code} Any other combination of nested ORDER BY clauses does produce the correct result. Please see attachments for query, schema and Hive Commands for reprocase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4365) wrong result in left semi join
[ https://issues.apache.org/jira/browse/HIVE-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633664#comment-13633664 ] Lianhui Wang commented on HIVE-4365: hi,ransom problem also exist in my environment. and i use explain statement and find that the second sql's ppd has error. TableScan alias: t2 Filter Operator predicate: expr: (c1 = 1) type: boolean the ppd optimizer push the filter c1='1' to table t1 and t2. but correct thing is table t1, not t2. wrong result in left semi join -- Key: HIVE-4365 URL: https://issues.apache.org/jira/browse/HIVE-4365 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.9.0, 0.10.0 Reporter: ransom.hezhiqiang wrong result in left semi join while hive.optimize.ppd=true for example: 1、create table create table t1(c1 int,c2 int, c3 int, c4 int, c5 double,c6 int,c7 string) row format DELIMITED FIELDS TERMINATED BY '|'; create table t2(c1 int) ; 2、load data load data local inpath '/home/test/t1.txt' OVERWRITE into table t1; load data local inpath '/home/test/t2.txt' OVERWRITE into table t2; t1 data: 1|3|10003|52|781.96|555|201203 1|3|10003|39|782.96|555|201203 1|3|10003|87|783.96|555|201203 2|5|10004|24|789.96|555|201203 2|5|10004|58|788.96|555|201203 t2 data: 555 3、excute Query select t1.c1,t1.c2,t1.c3,t1.c4,t1.c5,t1.c6,t1.c7 from t1 left semi join t2 on t1.c6 = t2.c1 and t1.c1 = '1' and t1.c7 = '201203' ; can got result. select t1.c1,t1.c2,t1.c3,t1.c4,t1.c5,t1.c6,t1.c7 from t1 left semi join t2 on t1.c6 = t2.c1 where t1.c1 = '1' and t1.c7 = '201203' ; can't got result. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3963) Allow Hive to connect to RDBMS
[ https://issues.apache.org/jira/browse/HIVE-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13596639#comment-13596639 ] Lianhui Wang commented on HIVE-3963: i think that must support as clause like transform syntax. for example: SELECT jdbcload('driver','url','user','password','sql') as c1,c2 FROM dual; Allow Hive to connect to RDBMS -- Key: HIVE-3963 URL: https://issues.apache.org/jira/browse/HIVE-3963 Project: Hive Issue Type: New Feature Components: Import/Export, JDBC, SQL, StorageHandler Affects Versions: 0.9.0, 0.10.0, 0.9.1, 0.11.0 Reporter: Maxime LANCIAUX I am thinking about something like : SELECT jdbcload('driver','url','user','password','sql') FROM dual; There is already a JIRA https://issues.apache.org/jira/browse/HIVE-1555 for JDBCStorageHandler -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4137) optimize group by followed by joins for bucketed/sorted tables
[ https://issues.apache.org/jira/browse/HIVE-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13596690#comment-13596690 ] Lianhui Wang commented on HIVE-4137: in addition. for bucketed/sorted tables, for single group by operator,it only needs map-group by operator and doesnot have reduce-group by operator. example: select key,aggr() from T1 group by key. now plan is TS-SEL-GBY-RS-GBY-SEL-FS but that can chang to following plan TS-SEL-GBY-SEL-FS optimize group by followed by joins for bucketed/sorted tables -- Key: HIVE-4137 URL: https://issues.apache.org/jira/browse/HIVE-4137 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Namit Jain Consider the following scenario: create table T1 (...) clustered by (key) sorted by (key) into 2 buckets; create table T2 (...) clustered by (key) sorted by (key) into 2 buckets; create table T3 (...) clustered by (key) sorted by (key) into 2 buckets; SET hive.enforce.sorting=true; SET hive.enforce.bucketing=true; insert overwrite table T3 select .. from (select key, aggr() from T1 group by key) s1 full outer join (select key, aggr() from T2 group by key) s2 on s1.key=s2.ley; Ideally, this query can be performed in a single map-only job. Group By - SortMerge Join. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3430) group by followed by join with the same key should be optimized
[ https://issues.apache.org/jira/browse/HIVE-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590321#comment-13590321 ] Lianhui Wang commented on HIVE-3430: also should consider the following query: SELECT a.key, a.cnt, b.key, a.cnt FROM (SELECT x.key as key, count(x.value) AS cnt FROM src x group by x.key) a JOIN src b ON (a.key = b.key); group by followed by join with the same key should be optimized --- Key: HIVE-3430 URL: https://issues.apache.org/jira/browse/HIVE-3430 Project: Hive Issue Type: Improvement Components: Query Processor Affects Versions: 0.10.0 Reporter: Namit Jain -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary
[ https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589226#comment-13589226 ] Lianhui Wang commented on HIVE-4014: hi,Tamas thank you very much,you are right. also i think rcfile.reader are not very efficient. the readed column ids are transfer to rcfile.reader. Hive+RCFile is not doing column pruning and reading much more data than necessary - Key: HIVE-4014 URL: https://issues.apache.org/jira/browse/HIVE-4014 Project: Hive Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli With even simple projection queries, I see that HDFS bytes read counter doesn't show any reduction in the amount of data read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary
[ https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586701#comment-13586701 ] Lianhui Wang commented on HIVE-4014: i donot think that. i see the code. in HiveInputFormat and CombineHiveInputFormat's getRecordReader(), it calls pushProjectionsAndFilters(). also in pushProjectionsAndFilters(), from TableScanOperator it get needed columns and set these ids to hive.io.file.readcolumn.ids. and then in RCFile.Reader will read hive.io.file.readcolumn.ids to skip column. maybe the counter has some mistakes. if i have mistake,please tell me.thx. Hive+RCFile is not doing column pruning and reading much more data than necessary - Key: HIVE-4014 URL: https://issues.apache.org/jira/browse/HIVE-4014 Project: Hive Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli With even simple projection queries, I see that HDFS bytes read counter doesn't show any reduction in the amount of data read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3420) Inefficiency in hbase handler when process query including rowkey range scan
[ https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13482075#comment-13482075 ] Lianhui Wang commented on HIVE-3420: @Gong Deng yes,i agree with you.in InputFormat getRecordReader() tableSplit = convertFilter(jobConf, scan, tableSplit, iKey, getStorageFormatOfKey(columnsMapping.get(iKey).mappingSpec, jobConf.get(HBaseSerDe.HBASE_TABLE_DEFAULT_STORAGE_TYPE, string))); it have done tableSplit = new TableSplit( tableSplit.getTableName(), startRow, stopRow, tableSplit.getRegionLocation(), tableSplit.getConf()); also in getplits(),a tableSplit lead to a regionLocation task.now that splits have not any effect. so startRow,stopRow in tableSplit is inside the region row ranges in tableSplit. IMO,the convertFilter() logic code used in many places.for example: HBaseStorageHandler.decomposePredicate() HiveHBaseTableInputFormat.getSplits() HiveHBaseTableInputFormat.getRecordReader() i think there need one place to use it. in HBaseStorageHandler.decomposePredicate().and that can store row key ranges. and then HiveHBaseTableInputFormat.getSplits(),HiveHBaseTableInputFormat.getRecordReader() according to table's regioninfo split the key ranges tasks. other have ideas?thx. Inefficiency in hbase handler when process query including rowkey range scan Key: HIVE-3420 URL: https://issues.apache.org/jira/browse/HIVE-3420 Project: Hive Issue Type: Improvement Components: HBase Handler Affects Versions: 0.9.0 Environment: Hive-0.9.0 + HBase-0.94.1 Reporter: Gang Deng Priority: Critical Original Estimate: 2h Remaining Estimate: 2h When query hive with hbase rowkey range, hive map tasks do not leverage startrow, endrow information in tablesplit. For example, if the rowkeys fit into 5 hbase files, then where will be 5 map tasks. Ideally, each task will process 1 file. But in current implementation, each task processes 5 files repeatedly. The behavior not only waste network bandwidth, but also worse the lock contention in HBase block cache as each task have to access the same block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below: …… if (tableSplit != null) { tableSplit = new TableSplit( tableSplit.getTableName(), startRow, stopRow, tableSplit.getRegionLocation()); } scan.setStartRow(startRow); scan.setStopRow(stopRow); …… As tableSplit already include startRow, endRow information of file, the better implementation will be: …… byte[] splitStart = startRow; byte[] splitStop = stopRow; if (tableSplit != null) { if(tableSplit.getStartRow() != null){ splitStart = startRow.length == 0 || Bytes.compareTo(tableSplit.getStartRow(), startRow) = 0 ? tableSplit.getStartRow() : startRow; } if(tableSplit.getEndRow() != null){ splitStop = (stopRow.length == 0 || Bytes.compareTo(tableSplit.getEndRow(), stopRow) = 0) tableSplit.getEndRow().length 0 ? tableSplit.getEndRow() : stopRow; } tableSplit = new TableSplit( tableSplit.getTableName(), splitStart, splitStop, tableSplit.getRegionLocation()); } scan.setStartRow(splitStart); scan.setStopRow(splitStop); …… In my test, the changed code will improve performance more than 30%. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-1643) support range scans and non-key columns in HBase filter pushdown
[ https://issues.apache.org/jira/browse/HIVE-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13482080#comment-13482080 ] Lianhui Wang commented on HIVE-1643: Ashutosh Chauhan Is this correct? What about filters on OR conditions and nested filters. Do you plan to add support for those ? select * from tt where col1 23 or (col2 2 and col3 = 5) or (col4 = 6 and (col5 = 3 or col6 = 7)); i think there should need range analyze. in mysql, sql optimizer include the range analyze on partition and index. binary tree represent conditions ranges. but there are some difficulties in task split. because maybe there are many small ranges in one table region. so maybe merge multi small ranges in one region and use rowkeyFilter. that can reduce one region's visits. support range scans and non-key columns in HBase filter pushdown Key: HIVE-1643 URL: https://issues.apache.org/jira/browse/HIVE-1643 Project: Hive Issue Type: Improvement Components: HBase Handler Affects Versions: 0.9.0 Reporter: John Sichi Assignee: bharath v Labels: patch Attachments: hbase_handler.patch, Hive-1643.2.patch, HIVE-1643.patch HIVE-1226 added support for WHERE rowkey=3. We would like to support WHERE rowkey BETWEEN 10 and 20, as well as predicates on non-rowkeys (plus conjunctions etc). Non-rowkey conditions can't be used to filter out entire ranges, but they can be used to push the per-row filter processing as far down as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3561) Build a full SQL-compliant parser for Hive
[ https://issues.apache.org/jira/browse/HIVE-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13473167#comment-13473167 ] Lianhui Wang commented on HIVE-3561: for the first approach,there is a problem. standardSQL can not support the HiveQL writting in historical. because there is a big difference in some operators. example:join. so that maybe spent a lot of time to transfering using hivesql to standardSQL. in my opinion,in short time,both maybe co-exist. Build a full SQL-compliant parser for Hive -- Key: HIVE-3561 URL: https://issues.apache.org/jira/browse/HIVE-3561 Project: Hive Issue Type: Sub-task Components: Query Processor Affects Versions: 0.10.0 Reporter: Shengsheng Huang To build a full SQL compliant engine on Hive, we'll need a full SQL complant parser. The current Hive parser missed a lot of grammar units from standard SQL. To support full SQL there're possibly four approaches: 1.Extend the existing Hive parser to support full SQL constructs. We need to modify the current Hive.g and add any missing grammar units and resolve conflicts. 2.Reuse an existing open source SQL compliant parser and extend it to support Hive extensions. We may need to adapt Semantic Analyzers to the new AST structure. 3.Reuse an existing SQL compliant parser and make it co-exist with the existing Hive parser. Both parsers share the same CliDriver interface. Use a query mode configuration to switch the query mode between SQL and HQL (this is the approach we're now using in the 0.9.0 demo project) 4.Reuse an existing SQL compliant parser and make it co-exist with the existing Hive parser. Use a separate xxxCliDriver interface for standard SQL. Let's discuss which is the best approach. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3472) Build An Analytical SQL Engine for MapReduce
[ https://issues.apache.org/jira/browse/HIVE-3472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13462391#comment-13462391 ] Lianhui Wang commented on HIVE-3472: nexr have done some works in oracle sql to hive sql. the session's address(from page 18): http://www.slideshare.net/cloudera/hadoop-world-2011-replacing-rdbdw-with-hadoop-and-hive-for-telco-big-data-jason-han-nexr i think we should transfer the oracle's syntax tree to hive's syntax tree.that maybe easy. another thing is directly transfer the oracle's sql to hive query play.but i think that need more time and works. Build An Analytical SQL Engine for MapReduce Key: HIVE-3472 URL: https://issues.apache.org/jira/browse/HIVE-3472 Project: Hive Issue Type: New Feature Affects Versions: 0.10.0 Reporter: Shengsheng Huang Attachments: SQL-design.pdf While there are continuous efforts in extending Hive’s SQL support (e.g., see some recent examples such as HIVE-2005 and HIVE-2810), many widely used SQL constructs are still not supported in HiveQL, such as selecting from multiple tables, subquery in WHERE clauses, etc. We propose to build a SQL-92 full compatible engine (for MapReduce based analytical query processing) as an extension to Hive. The SQL frontend will co-exist with the HiveQL frontend; consequently, one can mix SQL and HiveQL statements in their queries (switching between HiveQL mode and SQL-92 mode using a “hive.ql.mode” parameter before each query statement). This way useful Hive extensions are still accessible to users. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-3329) Support bucket filtering when where expression or join key expression has the bucket key
Lianhui Wang created HIVE-3329: -- Summary: Support bucket filtering when where expression or join key expression has the bucket key Key: HIVE-3329 URL: https://issues.apache.org/jira/browse/HIVE-3329 Project: Hive Issue Type: New Feature Components: Query Processor Reporter: Lianhui Wang in HIVE-3306, it introduces a context. example: select /* + MAPJOIN(a) */ count FROM bucket_small a JOIN bucket_big b ON a.key + a.key = b.key also there are some other contexts.i know the following example: 1. join expression is ON (a.key = b.key and a.key=10); 2. select * from bucket_small where a.key=10; 3. the table is a partition table,that maybe complex. example: CREATE TABLE srcbucket_part (key string, value string) partitioned by (ds string) CLUSTERED BY (key) INTO 4 BUCKETS STORED AS RCFile; select * from srcbucket_part where key='455' and ds='2008-04-08'; maybe complex sql is: select * from srcbucket_part where (key='455' and ds='2008-04-08') or ds='2008-04-09'; these contexts should not scan full table's files and only scan the some bucket files in the table path. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3306) SMBJoin/BucketMapJoin should be allowed only when join key expression is exactly matches with sort/cluster key
[ https://issues.apache.org/jira/browse/HIVE-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427127#comment-13427127 ] Lianhui Wang commented on HIVE-3306: @Namit, i created a new jira HIVE-3329, maybe there has some tasks. now i finish the work that the table is not partition table. next i will work for the partition table. SMBJoin/BucketMapJoin should be allowed only when join key expression is exactly matches with sort/cluster key -- Key: HIVE-3306 URL: https://issues.apache.org/jira/browse/HIVE-3306 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.10.0 Reporter: Navis Assignee: Navis Priority: Minor CREATE TABLE bucket_small (key int, value string) CLUSTERED BY (key) SORTED BY (key) INTO 2 BUCKETS STORED AS TEXTFILE; load data local inpath '/home/navis/apache/oss-hive/data/files/srcsortbucket1outof4.txt' INTO TABLE bucket_small; load data local inpath '/home/navis/apache/oss-hive/data/files/srcsortbucket2outof4.txt' INTO TABLE bucket_small; CREATE TABLE bucket_big (key int, value string) CLUSTERED BY (key) SORTED BY (key) INTO 4 BUCKETS STORED AS TEXTFILE; load data local inpath '/home/navis/apache/oss-hive/data/files/srcsortbucket1outof4.txt' INTO TABLE bucket_big; load data local inpath '/home/navis/apache/oss-hive/data/files/srcsortbucket2outof4.txt' INTO TABLE bucket_big; load data local inpath '/home/navis/apache/oss-hive/data/files/srcsortbucket3outof4.txt' INTO TABLE bucket_big; load data local inpath '/home/navis/apache/oss-hive/data/files/srcsortbucket4outof4.txt' INTO TABLE bucket_big; select count(*) FROM bucket_small a JOIN bucket_big b ON a.key + a.key = b.key; select /* + MAPJOIN(a) */ count(*) FROM bucket_small a JOIN bucket_big b ON a.key + a.key = b.key; returns 116 (same) But with BucketMapJoin or SMBJoin, it returns 61. But this should not be allowed cause hash(a.key) != hash(a.key + a.key). Bucket context should be utilized only with exact matching join expression with sort/cluster key. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3254) Reuse RunningJob
[ https://issues.apache.org/jira/browse/HIVE-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13424569#comment-13424569 ] Lianhui Wang commented on HIVE-3254: yes, i think that can do. but maybe the newRj is null.so you must check the null. because the jobtracker always cache the fixed-size completed job's infos. if the job that you get have completed,maybe the JT removed the job's information. Reuse RunningJob - Key: HIVE-3254 URL: https://issues.apache.org/jira/browse/HIVE-3254 Project: Hive Issue Type: Bug Reporter: binlijin private MapRedStats progress(ExecDriverTaskHandle th) throws IOException { while (!rj.isComplete()) { try { Thread.sleep(pullInterval); } catch (InterruptedException e) { } RunningJob newRj = jc.getJob(rj.getJobID()); } } Should we reuse the RunningJob? If not, why? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-942) use bucketing for group by
[ https://issues.apache.org/jira/browse/HIVE-942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13404730#comment-13404730 ] Lianhui Wang commented on HIVE-942: --- i think in HIVE-931 ,the group by keys must be the same with the sort keys. but in the case that the group by keys contain the sort keys, it may be complete it to use the hash table on the mapper. for example: t is a bucket table, sort by c1,c2. sql: select t.c1,t.c2,t.c3.sum(t.c4) from t group by t.c1,t.c2,t.c3. i think generally that only use the hash table on the mapper.so do not do anything on the reducer. use bucketing for group by -- Key: HIVE-942 URL: https://issues.apache.org/jira/browse/HIVE-942 Project: Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Group by on a bucketed column can be completely performed on the mapper if the split can be adjusted to span the key boundary. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira