[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

2014-08-21 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105301#comment-14105301
 ] 

Lianhui Wang commented on HIVE-7384:


i think current spark already support hash by join_col,sort by {join_col,tag}. 
because in spark map's shuffleWriter hash by Key.hashcode and sort by Key and 
in Hive HiveKey class already define the hashcode. so that can support hash by 
HiveKey.hashcode, sort by HiveKey's bytes

 Research into reduce-side join [Spark Branch]
 -

 Key: HIVE-7384
 URL: https://issues.apache.org/jira/browse/HIVE-7384
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Szehon Ho
 Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, 
 sales_products.txt, sales_stores.txt


 Hive's join operator is very sophisticated, especially for reduce-side join. 
 While we expect that other types of join, such as map-side join and SMB 
 map-side join, will work out of the box with our design, there may be some 
 complication in reduce-side join, which extensively utilizes key tag and 
 shuffle behavior. Our design principle prefers to making Hive implementation 
 work out of box also, which might requires new functionality from Spark. The 
 tasks is to research into this area, identifying requirements for Spark 
 community and the work to be done on Hive to make reduce-side join work.
 A design doc might be needed for this. For more information, please refer to 
 the overall design doc on wiki.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

2014-08-21 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106343#comment-14106343
 ] 

Lianhui Wang commented on HIVE-7384:


@Szehon Ho yes,i read OrderedRDDFunctions code and discove that sortByKey 
actually does a range-partition. we need to replace range-partition with hash 
partition. so spark maybe should create a new interface example: 
partitionSortByKey.
@Brock Noland  code in 1) means when sample data and more than one reducers, 
Hive does a total order sort. so join does not sample data, it does not need a 
total order sort.
2) i think we really need auto-parallelism. before i talk it with Reynold Xin, 
spark need to support re-partition mapoutput's data as same as tez does.

 Research into reduce-side join [Spark Branch]
 -

 Key: HIVE-7384
 URL: https://issues.apache.org/jira/browse/HIVE-7384
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Szehon Ho
 Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, 
 sales_products.txt, sales_stores.txt


 Hive's join operator is very sophisticated, especially for reduce-side join. 
 While we expect that other types of join, such as map-side join and SMB 
 map-side join, will work out of the box with our design, there may be some 
 complication in reduce-side join, which extensively utilizes key tag and 
 shuffle behavior. Our design principle prefers to making Hive implementation 
 work out of box also, which might requires new functionality from Spark. The 
 tasks is to research into this area, identifying requirements for Spark 
 community and the work to be done on Hive to make reduce-side join work.
 A design doc might be needed for this. For more information, please refer to 
 the overall design doc on wiki.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

2014-08-21 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106407#comment-14106407
 ] 

Lianhui Wang commented on HIVE-7384:


i think the thoughts is same as ideas that you said before. like HIVE-7158, 
that will auto-calculate the number of reducers based on some input from Hive 
(upper/lower bound).

 Research into reduce-side join [Spark Branch]
 -

 Key: HIVE-7384
 URL: https://issues.apache.org/jira/browse/HIVE-7384
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Szehon Ho
 Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, 
 sales_products.txt, sales_stores.txt


 Hive's join operator is very sophisticated, especially for reduce-side join. 
 While we expect that other types of join, such as map-side join and SMB 
 map-side join, will work out of the box with our design, there may be some 
 complication in reduce-side join, which extensively utilizes key tag and 
 shuffle behavior. Our design principle prefers to making Hive implementation 
 work out of box also, which might requires new functionality from Spark. The 
 tasks is to research into this area, identifying requirements for Spark 
 community and the work to be done on Hive to make reduce-side join work.
 A design doc might be needed for this. For more information, please refer to 
 the overall design doc on wiki.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-3430) group by followed by join with the same key should be optimized

2013-07-23 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13717883#comment-13717883
 ] 

Lianhui Wang commented on HIVE-3430:


Yin Huai,very nice work!

 group by followed by join with the same key should be optimized
 ---

 Key: HIVE-3430
 URL: https://issues.apache.org/jira/browse/HIVE-3430
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: Namit Jain



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4506) use one map reduce to join multiple small tables

2013-05-06 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13650371#comment-13650371
 ] 

Lianhui Wang commented on HIVE-4506:


Fern, can you provide your sql?
if these tables used the same column in join clause, it used one mr.
example:
explain
SELECT /*+mapjoin(src2,src3)*/ src1.key, src3.value FROM src src1 JOIN src src2 
ON (src1.key = src2.key) JOIN src src3 ON (src1.key = src3.key);



 use one map reduce to join multiple small tables 
 -

 Key: HIVE-4506
 URL: https://issues.apache.org/jira/browse/HIVE-4506
 Project: Hive
  Issue Type: Wish
Affects Versions: 0.10.0
Reporter: Fern
Priority: Minor

 I know we can use map side join for small table.
 by my test, if I use HQL like this
 --
 select /*+mapjoin(b,c)*/...
 from a
 left join b
 on ...
 left join c
 on ...
 ---
 b and c are both small tables, I expect do the join in one map reduce using 
 map side join. Actually, it would generate two map-reduce jobs by sequence.
 Sorry, currently I am just a user of hive and not dig into the code, so this 
 is what I expect but I have no idea about how to improve now. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4506) use one map reduce to join multiple small tables

2013-05-06 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13650380#comment-13650380
 ] 

Lianhui Wang commented on HIVE-4506:


if these have difference column, HIVE-3784 resolved one big table with multiple 
small tables.

 use one map reduce to join multiple small tables 
 -

 Key: HIVE-4506
 URL: https://issues.apache.org/jira/browse/HIVE-4506
 Project: Hive
  Issue Type: Wish
Affects Versions: 0.10.0
Reporter: Fern
Priority: Minor

 I know we can use map side join for small table.
 by my test, if I use HQL like this
 --
 select /*+mapjoin(b,c)*/...
 from a
 left join b
 on ...
 left join c
 on ...
 ---
 b and c are both small tables, I expect do the join in one map reduce using 
 map side join. Actually, it would generate two map-reduce jobs by sequence.
 Sorry, currently I am just a user of hive and not dig into the code, so this 
 is what I expect but I have no idea about how to improve now. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4429) Nested ORDER BY produces incorrect result

2013-04-26 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13643453#comment-13643453
 ] 

Lianhui Wang commented on HIVE-4429:


hi, Mihir Kulkarni 
i run the first sql of your cases, but in my hive-0.9, it produces correct 
result.it is the following.
30.01.0
20.01.0
10.01.0
30.02.0
20.02.0
10.02.0
30.03.0
20.03.0
10.03.0
60.04.0
50.04.0
40.04.0
60.05.0
50.05.0
40.05.0
60.06.0
50.06.0
40.06.0

so can you tell which version you used.



 Nested ORDER BY produces incorrect result
 -

 Key: HIVE-4429
 URL: https://issues.apache.org/jira/browse/HIVE-4429
 Project: Hive
  Issue Type: Bug
  Components: Query Processor, SQL, UDF
Affects Versions: 0.9.0
 Environment: Red Hat Linux VM with Hive 0.9 and Hadoop 2.0
Reporter: Mihir Kulkarni
Priority: Critical
 Attachments: Hive_Command_Script.txt, HiveQuery.txt, Test_Data.txt


 Nested ORDER BY clause doesn't honor the outer one in specific case.
 The below query produces result which honors only the inner ORDER BY clause. 
 (it produces only 1 MapRed job)
 {code:borderStyle=solid}
 SELECT alias.b0 as d0, alias.b1 as d1
 FROM
 (SELECT test.a0 as b0, test.a1 as b1 
 FROM test
 ORDER BY b1 ASC, b0 DESC) alias
 ORDER BY d0 ASC, d1 DESC;
 {code}
 
 On the other hand the query below honors the outer ORDER BY clause which 
 produces the correct result. (it produces 2 MapRed jobs)
 {code:borderStyle=solid}
 SELECT alias.b0 as d0, alias.b1 as d1
 FROM
 (SELECT test.a0 as b0, test.a1 as b1 
 FROM test
 ORDER BY b1 ASC, b0 DESC) alias
 ORDER BY d0 DESC, d1 DESC;
 {code}
 
 Any other combination of nested ORDER BY clauses does produce the correct 
 result.
 Please see attachments for query, schema and Hive Commands for reprocase.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4365) wrong result in left semi join

2013-04-16 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633664#comment-13633664
 ] 

Lianhui Wang commented on HIVE-4365:


hi,ransom
problem also exist in my environment. and i use explain statement and find that 
the second sql's ppd has error.
TableScan
alias: t2
Filter Operator
  predicate:
  expr: (c1 = 1)
  type: boolean

the ppd optimizer push the filter c1='1' to table t1 and t2.
but correct thing is table t1, not t2.


 wrong result in left semi join
 --

 Key: HIVE-4365
 URL: https://issues.apache.org/jira/browse/HIVE-4365
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.9.0, 0.10.0
Reporter: ransom.hezhiqiang

 wrong result in left semi join while hive.optimize.ppd=true
 for example:
 1、create table
create table t1(c1 int,c2 int, c3 int, c4 int, c5 double,c6 int,c7 string) 
   row format DELIMITED FIELDS TERMINATED BY '|';
create table t2(c1 int) ;
 2、load data
 load data local inpath '/home/test/t1.txt' OVERWRITE into table t1;
 load data local inpath '/home/test/t2.txt' OVERWRITE into table t2;
 t1 data:
 1|3|10003|52|781.96|555|201203
 1|3|10003|39|782.96|555|201203
 1|3|10003|87|783.96|555|201203
 2|5|10004|24|789.96|555|201203
 2|5|10004|58|788.96|555|201203
 t2 data:
 555
 3、excute Query
 select t1.c1,t1.c2,t1.c3,t1.c4,t1.c5,t1.c6,t1.c7  from t1 left semi join t2 
 on t1.c6 = t2.c1 and  t1.c1 =  '1' and t1.c7 = '201203' ;   
 can got result.
 select t1.c1,t1.c2,t1.c3,t1.c4,t1.c5,t1.c6,t1.c7  from t1 left semi join t2 
 on t1.c6 = t2.c1 where t1.c1 =  '1' and t1.c7 = '201203' ;   
 can't got result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3963) Allow Hive to connect to RDBMS

2013-03-07 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13596639#comment-13596639
 ] 

Lianhui Wang commented on HIVE-3963:


i think that must support as clause like transform syntax.
for example:
SELECT jdbcload('driver','url','user','password','sql') as c1,c2 FROM dual;

 Allow Hive to connect to RDBMS
 --

 Key: HIVE-3963
 URL: https://issues.apache.org/jira/browse/HIVE-3963
 Project: Hive
  Issue Type: New Feature
  Components: Import/Export, JDBC, SQL, StorageHandler
Affects Versions: 0.9.0, 0.10.0, 0.9.1, 0.11.0
Reporter: Maxime LANCIAUX

 I am thinking about something like :
 SELECT jdbcload('driver','url','user','password','sql') FROM dual;
 There is already a JIRA https://issues.apache.org/jira/browse/HIVE-1555 for 
 JDBCStorageHandler

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4137) optimize group by followed by joins for bucketed/sorted tables

2013-03-07 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13596690#comment-13596690
 ] 

Lianhui Wang commented on HIVE-4137:


in addition. for bucketed/sorted tables, for single group by operator,it only 
needs map-group by operator and doesnot have reduce-group by operator.
example:
select key,aggr() from T1 group by key.
now plan is
TS-SEL-GBY-RS-GBY-SEL-FS
but that can chang to following plan
TS-SEL-GBY-SEL-FS


 optimize group by followed by joins for bucketed/sorted tables
 --

 Key: HIVE-4137
 URL: https://issues.apache.org/jira/browse/HIVE-4137
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Reporter: Namit Jain

 Consider the following scenario:
 create table T1 (...) clustered by (key) sorted by (key) into 2 buckets;
 create table T2 (...) clustered by (key) sorted by (key) into 2 buckets;
 create table T3 (...) clustered by (key) sorted by (key) into 2 buckets;
 SET hive.enforce.sorting=true;
 SET hive.enforce.bucketing=true;
 insert overwrite table T3
 select ..
 from 
 (select key, aggr() from T1 group by key) s1
 full outer join
 (select key, aggr() from T2 group by key) s2
 on s1.key=s2.ley;
 Ideally, this query can be performed in a single map-only job.
 Group By - SortMerge Join.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3430) group by followed by join with the same key should be optimized

2013-03-01 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590321#comment-13590321
 ] 

Lianhui Wang commented on HIVE-3430:


also should consider the following query:
SELECT a.key, a.cnt, b.key, a.cnt
FROM
(SELECT x.key as key, count(x.value) AS cnt FROM src x group by x.key) a
JOIN src b
ON (a.key = b.key);


 group by followed by join with the same key should be optimized
 ---

 Key: HIVE-3430
 URL: https://issues.apache.org/jira/browse/HIVE-3430
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: Namit Jain



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary

2013-02-27 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589226#comment-13589226
 ] 

Lianhui Wang commented on HIVE-4014:


hi,Tamas
thank you very much,you are right.
also i think rcfile.reader are not very efficient.
the readed column ids are transfer to rcfile.reader.


 Hive+RCFile is not doing column pruning and reading much more data than 
 necessary
 -

 Key: HIVE-4014
 URL: https://issues.apache.org/jira/browse/HIVE-4014
 Project: Hive
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 With even simple projection queries, I see that HDFS bytes read counter 
 doesn't show any reduction in the amount of data read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary

2013-02-25 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586701#comment-13586701
 ] 

Lianhui Wang commented on HIVE-4014:


i donot think that.
i see the code.
in HiveInputFormat and CombineHiveInputFormat's getRecordReader(), it calls 
pushProjectionsAndFilters().
also in pushProjectionsAndFilters(), from TableScanOperator it get needed 
columns and  set these ids to hive.io.file.readcolumn.ids.
and then in RCFile.Reader will read hive.io.file.readcolumn.ids to skip column.
maybe the counter has some mistakes.
if i have mistake,please tell me.thx.

 Hive+RCFile is not doing column pruning and reading much more data than 
 necessary
 -

 Key: HIVE-4014
 URL: https://issues.apache.org/jira/browse/HIVE-4014
 Project: Hive
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 With even simple projection queries, I see that HDFS bytes read counter 
 doesn't show any reduction in the amount of data read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3420) Inefficiency in hbase handler when process query including rowkey range scan

2012-10-22 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13482075#comment-13482075
 ] 

Lianhui Wang commented on HIVE-3420:


@Gong Deng
yes,i agree with you.in InputFormat getRecordReader()
tableSplit = convertFilter(jobConf, scan, tableSplit, iKey,
  getStorageFormatOfKey(columnsMapping.get(iKey).mappingSpec,
  jobConf.get(HBaseSerDe.HBASE_TABLE_DEFAULT_STORAGE_TYPE, string)));
it have done
tableSplit = new TableSplit(
tableSplit.getTableName(),
startRow,
stopRow,
tableSplit.getRegionLocation(),
tableSplit.getConf());
also in getplits(),a tableSplit lead to a regionLocation task.now that splits 
have not any effect. 
so startRow,stopRow in tableSplit is inside the region row ranges in tableSplit.

IMO,the convertFilter() logic code used in many places.for example:
HBaseStorageHandler.decomposePredicate()
HiveHBaseTableInputFormat.getSplits()
HiveHBaseTableInputFormat.getRecordReader()

i think there need one place to use it. in 
HBaseStorageHandler.decomposePredicate().and that can store row key ranges.
and then 
HiveHBaseTableInputFormat.getSplits(),HiveHBaseTableInputFormat.getRecordReader()
 according to table's regioninfo split the key ranges tasks.

other have ideas?thx.



 Inefficiency in hbase handler when process query including rowkey range scan
 

 Key: HIVE-3420
 URL: https://issues.apache.org/jira/browse/HIVE-3420
 Project: Hive
  Issue Type: Improvement
  Components: HBase Handler
Affects Versions: 0.9.0
 Environment: Hive-0.9.0 + HBase-0.94.1
Reporter: Gang Deng
Priority: Critical
   Original Estimate: 2h
  Remaining Estimate: 2h

 When query hive with hbase rowkey range, hive map tasks do not leverage 
 startrow, endrow information in tablesplit. For example, if the rowkeys fit 
 into 5 hbase files, then where will be 5 map tasks. Ideally, each task will 
 process 1 file. But in current implementation, each task processes 5 files 
 repeatedly. The behavior not only waste network bandwidth, but also worse the 
 lock contention in HBase block cache as each task have to access the same 
 block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below:
 ……
 if (tableSplit != null) {
   tableSplit = new TableSplit(
 tableSplit.getTableName(),
 startRow,
 stopRow,
 tableSplit.getRegionLocation());
 }
 scan.setStartRow(startRow);
 scan.setStopRow(stopRow);
 ……
 As tableSplit already include startRow, endRow information of file, the 
 better implementation will be:
 ……
 byte[] splitStart = startRow;
 byte[] splitStop = stopRow;
 if (tableSplit != null) {
 
if(tableSplit.getStartRow() != null){
 splitStart = startRow.length == 0 ||
   Bytes.compareTo(tableSplit.getStartRow(), startRow) = 0 ?
 tableSplit.getStartRow() : startRow;
 }
 if(tableSplit.getEndRow() != null){
 splitStop = (stopRow.length == 0 ||
   Bytes.compareTo(tableSplit.getEndRow(), stopRow) = 0) 
   tableSplit.getEndRow().length  0 ?
 tableSplit.getEndRow() : stopRow;
 }   
   tableSplit = new TableSplit(
 tableSplit.getTableName(),
 splitStart,
 splitStop,
 tableSplit.getRegionLocation());
 }
 scan.setStartRow(splitStart);
 scan.setStopRow(splitStop);
 ……
 In my test, the changed code will improve performance more than 30%.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-1643) support range scans and non-key columns in HBase filter pushdown

2012-10-22 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13482080#comment-13482080
 ] 

Lianhui Wang commented on HIVE-1643:


Ashutosh Chauhan 
Is this correct? What about filters on OR conditions and nested filters. Do you 
plan to add support for those ?
select * from tt where col1  23 or (col2  2 and col3 = 5) or (col4 = 6 and 
(col5 = 3 or col6 = 7));

i think there should need range analyze.
in mysql, sql optimizer include the range analyze on partition and index.
binary tree represent conditions ranges.
but there are some difficulties in task split.
because maybe there are many small ranges in one table region. so maybe merge 
multi small ranges in one region and use rowkeyFilter.
that can reduce one region's visits.



 support range scans and non-key columns in HBase filter pushdown
 

 Key: HIVE-1643
 URL: https://issues.apache.org/jira/browse/HIVE-1643
 Project: Hive
  Issue Type: Improvement
  Components: HBase Handler
Affects Versions: 0.9.0
Reporter: John Sichi
Assignee: bharath v
  Labels: patch
 Attachments: hbase_handler.patch, Hive-1643.2.patch, HIVE-1643.patch


 HIVE-1226 added support for WHERE rowkey=3.  We would like to support WHERE 
 rowkey BETWEEN 10 and 20, as well as predicates on non-rowkeys (plus 
 conjunctions etc).  Non-rowkey conditions can't be used to filter out entire 
 ranges, but they can be used to push the per-row filter processing as far 
 down as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3561) Build a full SQL-compliant parser for Hive

2012-10-10 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13473167#comment-13473167
 ] 

Lianhui Wang commented on HIVE-3561:


for the first approach,there is a problem. standardSQL can not support the 
HiveQL writting in historical.
because there is a big difference in some operators. example:join.
so that maybe spent a lot of time to transfering using hivesql to standardSQL.
in my opinion,in short time,both maybe co-exist.

 

 Build a full SQL-compliant parser for Hive
 --

 Key: HIVE-3561
 URL: https://issues.apache.org/jira/browse/HIVE-3561
 Project: Hive
  Issue Type: Sub-task
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: Shengsheng Huang

 To build a full SQL compliant engine on Hive, we'll need a full SQL complant 
 parser. The current Hive parser missed a lot of grammar units from standard 
 SQL. To support full SQL there're possibly four approaches:
 1.Extend the existing Hive parser to support full SQL constructs. We need to 
 modify the current Hive.g and add any missing grammar units and resolve 
 conflicts. 
 2.Reuse an existing open source SQL compliant parser and extend it to support 
 Hive extensions. We may need to adapt Semantic Analyzers to the new AST 
 structure.  
 3.Reuse an existing SQL compliant parser and make it co-exist with the 
 existing Hive parser. Both parsers share the same CliDriver interface. Use a 
 query mode configuration to switch the query mode between SQL and HQL (this 
 is the approach we're now using in the 0.9.0 demo project)
 4.Reuse an existing SQL compliant parser and make it co-exist with the 
 existing Hive parser. Use a separate xxxCliDriver interface for standard SQL. 
  
 Let's discuss which is the best approach. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3472) Build An Analytical SQL Engine for MapReduce

2012-09-24 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13462391#comment-13462391
 ] 

Lianhui Wang commented on HIVE-3472:


nexr have done some works in oracle sql to hive sql.
the session's address(from page 18):
http://www.slideshare.net/cloudera/hadoop-world-2011-replacing-rdbdw-with-hadoop-and-hive-for-telco-big-data-jason-han-nexr
i think we should transfer the oracle's syntax tree to hive's syntax tree.that 
maybe easy.
another thing is directly transfer the oracle's sql to hive query play.but i 
think that need more time and works.


 Build An Analytical SQL Engine for MapReduce
 

 Key: HIVE-3472
 URL: https://issues.apache.org/jira/browse/HIVE-3472
 Project: Hive
  Issue Type: New Feature
Affects Versions: 0.10.0
Reporter: Shengsheng Huang
 Attachments: SQL-design.pdf


 While there are continuous efforts in extending Hive’s SQL support (e.g., see 
 some recent examples such as HIVE-2005 and HIVE-2810), many widely used SQL 
 constructs are still not supported in HiveQL, such as selecting from multiple 
 tables, subquery in WHERE clauses, etc.  
 We propose to build a SQL-92 full compatible engine (for MapReduce based 
 analytical query processing) as an extension to Hive. 
 The SQL frontend will co-exist with the HiveQL frontend; consequently, one 
 can  mix SQL and HiveQL statements in their queries (switching between HiveQL 
 mode and SQL-92 mode using a “hive.ql.mode” parameter before each query 
 statement). This way useful Hive extensions are still accessible to users. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-3329) Support bucket filtering when where expression or join key expression has the bucket key

2012-08-01 Thread Lianhui Wang (JIRA)
Lianhui Wang created HIVE-3329:
--

 Summary: Support bucket filtering when where expression or join 
key expression has the bucket key 
 Key: HIVE-3329
 URL: https://issues.apache.org/jira/browse/HIVE-3329
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Lianhui Wang


in HIVE-3306, it introduces a context.
example:
select /* + MAPJOIN(a) */ count FROM bucket_small a JOIN bucket_big b ON a.key 
+ a.key = b.key
also there are some other contexts.i know the following example:
1. join expression is ON (a.key = b.key and a.key=10);
2. select * from bucket_small where a.key=10;
3. 
the table is a partition table,that maybe complex.
example:
CREATE TABLE srcbucket_part (key string, value string) partitioned by (ds 
string) CLUSTERED BY (key) INTO 4 BUCKETS STORED AS RCFile;
select * from srcbucket_part where key='455' and ds='2008-04-08';
maybe complex sql is:
select * from srcbucket_part where (key='455' and ds='2008-04-08') or  
ds='2008-04-09';
these contexts should not scan full table's files and only scan the some bucket 
files in the table path.




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-3306) SMBJoin/BucketMapJoin should be allowed only when join key expression is exactly matches with sort/cluster key

2012-08-01 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427127#comment-13427127
 ] 

Lianhui Wang commented on HIVE-3306:


@Namit, i created a new jira HIVE-3329, maybe there has some tasks.
now i finish the work that the table is not partition table.
next i will work for the partition table.

 SMBJoin/BucketMapJoin should be allowed only when join key expression is 
 exactly matches with sort/cluster key
 --

 Key: HIVE-3306
 URL: https://issues.apache.org/jira/browse/HIVE-3306
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.10.0
Reporter: Navis
Assignee: Navis
Priority: Minor

 CREATE TABLE bucket_small (key int, value string) CLUSTERED BY (key) SORTED 
 BY (key) INTO 2 BUCKETS STORED AS TEXTFILE;
 load data local inpath 
 '/home/navis/apache/oss-hive/data/files/srcsortbucket1outof4.txt' INTO TABLE 
 bucket_small;
 load data local inpath 
 '/home/navis/apache/oss-hive/data/files/srcsortbucket2outof4.txt' INTO TABLE 
 bucket_small;
 CREATE TABLE bucket_big (key int, value string) CLUSTERED BY (key) SORTED BY 
 (key) INTO 4 BUCKETS STORED AS TEXTFILE;
 load data local inpath 
 '/home/navis/apache/oss-hive/data/files/srcsortbucket1outof4.txt' INTO TABLE 
 bucket_big;
 load data local inpath 
 '/home/navis/apache/oss-hive/data/files/srcsortbucket2outof4.txt' INTO TABLE 
 bucket_big;
 load data local inpath 
 '/home/navis/apache/oss-hive/data/files/srcsortbucket3outof4.txt' INTO TABLE 
 bucket_big;
 load data local inpath 
 '/home/navis/apache/oss-hive/data/files/srcsortbucket4outof4.txt' INTO TABLE 
 bucket_big;
 select count(*) FROM bucket_small a JOIN bucket_big b ON a.key + a.key = 
 b.key;
 select /* + MAPJOIN(a) */ count(*) FROM bucket_small a JOIN bucket_big b ON 
 a.key + a.key = b.key;
 returns 116 (same) 
 But with BucketMapJoin or SMBJoin, it returns 61. But this should not be 
 allowed cause hash(a.key) != hash(a.key + a.key). 
 Bucket context should be utilized only with exact matching join expression 
 with sort/cluster key.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-3254) Reuse RunningJob

2012-07-29 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13424569#comment-13424569
 ] 

Lianhui Wang commented on HIVE-3254:


yes, i think that can do.
but maybe the newRj is null.so you must check the null.
because the jobtracker always cache the fixed-size completed job's infos.
if the job that you get have completed,maybe the JT removed the job's 
information.

 Reuse RunningJob 
 -

 Key: HIVE-3254
 URL: https://issues.apache.org/jira/browse/HIVE-3254
 Project: Hive
  Issue Type: Bug
Reporter: binlijin

   private MapRedStats progress(ExecDriverTaskHandle th) throws IOException {
 while (!rj.isComplete()) {
try {
  Thread.sleep(pullInterval); 
} catch (InterruptedException e) { 
} 
RunningJob newRj = jc.getJob(rj.getJobID());
 }
   }
 Should we reuse the RunningJob? If not, why? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-942) use bucketing for group by

2012-07-01 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13404730#comment-13404730
 ] 

Lianhui Wang commented on HIVE-942:
---

i think in HIVE-931 ,the group by keys must be the same with the sort keys.
but in the case that the group by keys contain the sort keys, it may be 
complete it to use the hash table on the mapper.
for example:
t is a bucket table, sort by c1,c2.
sql: select t.c1,t.c2,t.c3.sum(t.c4) from t group by t.c1,t.c2,t.c3.
i think generally that only use the hash table on the mapper.so do not do 
anything on the reducer.
 

 use bucketing for group by
 --

 Key: HIVE-942
 URL: https://issues.apache.org/jira/browse/HIVE-942
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Namit Jain

 Group by on a bucketed column can be completely performed on the mapper if 
 the split can be adjusted to span the key boundary.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira