[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary

2013-04-15 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13631853#comment-13631853
 ] 

Ashutosh Chauhan commented on HIVE-4014:


Thanks [~tamastarjanyi] for your investigation with different version 
combinations.

 Hive+RCFile is not doing column pruning and reading much more data than 
 necessary
 -

 Key: HIVE-4014
 URL: https://issues.apache.org/jira/browse/HIVE-4014
 Project: Hive
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
 Fix For: 0.10.0


 With even simple projection queries, I see that HDFS bytes read counter 
 doesn't show any reduction in the amount of data read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary

2013-04-03 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621026#comment-13621026
 ] 

Ashutosh Chauhan commented on HIVE-4014:


bq. Okay, I cannot reproduce this on trunk, though I was consistently hitting 
this on hive-0.10. I'll try hive-0.10 again to be sure some other patch fixed 
this.

Can than this be resolved as Cannnot Reproduce ?

 Hive+RCFile is not doing column pruning and reading much more data than 
 necessary
 -

 Key: HIVE-4014
 URL: https://issues.apache.org/jira/browse/HIVE-4014
 Project: Hive
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 With even simple projection queries, I see that HDFS bytes read counter 
 doesn't show any reduction in the amount of data read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary

2013-04-03 Thread Tamas Tarjanyi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621063#comment-13621063
 ] 

Tamas Tarjanyi commented on HIVE-4014:
--

Meanwhile I have tested 4.2.0 cloudera release with so called parcel CDH 
4.2.0-1.cdh4.2.0.p0.10
https://ccp.cloudera.com/display/SUPPORT/CDH4+Downloadable+Tarballs
Unfortunately it still seems to have this problem. (Did not do very rigorous 
testing now.)
As noted above apache hive 0.10.0 seems to be fine.

If you set it to Can not Reproduce than I am afraid this bug will not be 
visible to the hadoop users and some unlucky one will use the wrong version. 
Also I am afraid that distributors will not realize this as a problem. Can you 
set it to Fixed in 0.10.0? Than no more work. Issue can be closed since trunk 
is fine. But if somebody checks bug list this will be visible as a possible 
problem.

Thanks.

 Hive+RCFile is not doing column pruning and reading much more data than 
 necessary
 -

 Key: HIVE-4014
 URL: https://issues.apache.org/jira/browse/HIVE-4014
 Project: Hive
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 With even simple projection queries, I see that HDFS bytes read counter 
 doesn't show any reduction in the amount of data read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary

2013-03-02 Thread Tamas Tarjanyi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591350#comment-13591350
 ] 

Tamas Tarjanyi commented on HIVE-4014:
--

Hi Vinod,

As I have stated above 

BAD: CDH4.1.3 - which is using hadoop-2.0.0+556 / hive-0.9.0+158
GOOD: hadoop 1.0.3 / hive 0.10.0 (apache download)
GOOD: hadoop 1.0.4 / hive 0.10.0 (apache download)
Meanwhile I have also tried Hortonworks Data Platform 1.2.1
GOOD: HDP1.2.1 Apache Hadoop 1.1.2-rc3 / Apache Hive 0.10.0

So it seems that the issue is in hive-0.9 now.

My real problem is that both Hortonworks and Cloudera bundle hive-0.9 with 
hadoop-2.x.y and I wanted to use hadoop-2.x.y with hive-0.10.x and not hadoop-1.




 Hive+RCFile is not doing column pruning and reading much more data than 
 necessary
 -

 Key: HIVE-4014
 URL: https://issues.apache.org/jira/browse/HIVE-4014
 Project: Hive
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 With even simple projection queries, I see that HDFS bytes read counter 
 doesn't show any reduction in the amount of data read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary

2013-03-01 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591058#comment-13591058
 ] 

Vinod Kumar Vavilapalli commented on HIVE-4014:
---

Okay, I cannot reproduce this on trunk, though I was consistently hitting this 
on hive-0.10. I'll try hive-0.10 again to be sure some other patch fixed this.

[~tamastarjanyi], what version are you using?

 Hive+RCFile is not doing column pruning and reading much more data than 
 necessary
 -

 Key: HIVE-4014
 URL: https://issues.apache.org/jira/browse/HIVE-4014
 Project: Hive
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 With even simple projection queries, I see that HDFS bytes read counter 
 doesn't show any reduction in the amount of data read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary

2013-02-27 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589226#comment-13589226
 ] 

Lianhui Wang commented on HIVE-4014:


hi,Tamas
thank you very much,you are right.
also i think rcfile.reader are not very efficient.
the readed column ids are transfer to rcfile.reader.


 Hive+RCFile is not doing column pruning and reading much more data than 
 necessary
 -

 Key: HIVE-4014
 URL: https://issues.apache.org/jira/browse/HIVE-4014
 Project: Hive
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 With even simple projection queries, I see that HDFS bytes read counter 
 doesn't show any reduction in the amount of data read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary

2013-02-26 Thread Tamas Tarjanyi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13587776#comment-13587776
 ] 

Tamas Tarjanyi commented on HIVE-4014:
--

I could not check the code but experience is different. See my results.

set dfs.replication=1;
set mapred.submit.replication=1;
set mapred.submit.replication=1;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions.pernode=1000;

-- # Below command will create an artificial csv input file ~ 1.5Gb with 10m 
rows
-- # i=0 ; while [ $i -lt 1000 ] ; do echo 
$i,$((i%5)),12345678901234567890123456789012345678901234567890,abcdefghjkabcdefghjkabcdefghjkabcdefghjkabcdefghjk,00,,,
 ; ((i=i+1)) ; done /mnt/hadoop/RCTEST.csv

-- Load this data into a base table
DROP TABLE RCTEST_CSV;
CREATE TABLE RCTEST_CSV (
  id   BIGINT,
  counter  STRING,
  value2   STRING,
  value3   STRING,
  value4   STRING,
  value5   STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INPATH /mnt/hadoop/RCTEST.csv INTO TABLE RCTEST_CSV;

-- Multiply it into a BAD table. 
-- This will require ~30Gb altogether and will have 200m rows.
DROP TABLE RCTEST_BAD;
CREATE TABLE RCTEST_BAD
STORED AS RCFILE
AS 
  SELECT * FROM RCTEST_CSV
;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;
INSERT INTO TABLE RCTEST_BAD
SELECT * FROM RCTEST_CSV;

-- Create a table to simulate the expected column pruning.
-- Below will require ~1.5Gb and will also have 200m rows.
DROP TABLE RCTEST_GOOD;
CREATE TABLE RCTEST_GOOD
STORED AS RCFILE
AS 
  SELECT id,counter FROM RCTEST_BAD
;

-- Lets start measuring performance.
-- For this test my cluster had 1 node with 2 disks in a stripe. (Rest of the 
nodes were decomissioned to eliminate parallel reads accross nodes.)
-- Read speed is roughly 110Mb/sec/disk. In global this means 220Mb/sec max.
--
-- Please execute the below OS command to eliminate OS caching everytime before 
you execute any of the below SQL command!!!
--
-- sync  echo 3 /proc/sys/vm/drop_caches
--
-- 1.) Should be slow. This is the reference value for full read.
SELECT count(*) FROM RCTEST_BAD;
-- RESULT: 158 sec
--
-- 2.) Should be faster than 1 because of column pruning.
SELECT count(id) FROM RCTEST_BAD;
-- RESULT: 156 sec
-- Actually I belive this is a proof that column pruning is not working. But 
lets go further.
--
-- 3.) Roughly should be like 2 but faster than 1.
SELECT COUNT(counter) FROM RCTEST_BAD;
-- RESULT: 159 sec
--
--
--
-- Lets see how it works with the simulated pruning.
--
-- 4.) Should be faster than 1. In theory roughly should be same as 2 with 
proper pruning.
SELECT count(*) from RCTEST_GOOD;
-- RESULT: 29 sec
--
-- 5.) Should be same as 4 since pruning does not work here either.
SELECT count(id) FROM RCTEST_GOOD;
-- RESULT: 31 sec
--
-- 6.) Should be same as 4 since pruning does not work here either
SELECT count(counter) FROM RCTEST_GOOD;
-- RESULT: 31 sec
--
--
-- SHORT TEST WITH A GOOD VERSION
-- 
-- Now lets see 1. and 2. on a well working apache 1.0.4 release.
-- The machine is my desktop notebook. One harddisk only. 
-- Amount of data is ONLY 10m rows now.
-- 2.1.) Should be slow. This is the reference value for full read.
SELECT count(*) FROM RCTEST_BAD;
-- RESULT: 80 sec
--
-- 2.2.) Should be faster than 2.1 because of column pruning.
SELECT count(id) FROM RCTEST_BAD;
-- RESULT: 47 sec
-- YS That is what I assumed. More data would mean more performance 
gain especially on better hardwares.
--
-- 2.5.) Should be faster than 2.1. In theory roughly should be same as 2.2 
with proper pruning.
SELECT count(id) from RCTEST_GOOD;
-- RESULT: 44 sec
-- HERE WE ARE! AS EXPECTED.



 Hive+RCFile is not doing column pruning and reading much more data than 
 necessary
 -

 Key: 

[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary

2013-02-25 Thread Tamas Tarjanyi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586425#comment-13586425
 ] 

Tamas Tarjanyi commented on HIVE-4014:
--

I can not see affected versions here but I found the same issue on 

CDH4.1.3 - which is using hadoop-2.0.0+556 / hive-0.9.0+158

Than I have downloaded and tested 
hadoop 1.0.3 / hive 0.10.0 and
hadoop 1.0.4 / hive 0.10.0 
Both working fine and pruning is effective in these case.


 Hive+RCFile is not doing column pruning and reading much more data than 
 necessary
 -

 Key: HIVE-4014
 URL: https://issues.apache.org/jira/browse/HIVE-4014
 Project: Hive
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 With even simple projection queries, I see that HDFS bytes read counter 
 doesn't show any reduction in the amount of data read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary

2013-02-25 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586701#comment-13586701
 ] 

Lianhui Wang commented on HIVE-4014:


i donot think that.
i see the code.
in HiveInputFormat and CombineHiveInputFormat's getRecordReader(), it calls 
pushProjectionsAndFilters().
also in pushProjectionsAndFilters(), from TableScanOperator it get needed 
columns and  set these ids to hive.io.file.readcolumn.ids.
and then in RCFile.Reader will read hive.io.file.readcolumn.ids to skip column.
maybe the counter has some mistakes.
if i have mistake,please tell me.thx.

 Hive+RCFile is not doing column pruning and reading much more data than 
 necessary
 -

 Key: HIVE-4014
 URL: https://issues.apache.org/jira/browse/HIVE-4014
 Project: Hive
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 With even simple projection queries, I see that HDFS bytes read counter 
 doesn't show any reduction in the amount of data read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary

2013-02-12 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13577216#comment-13577216
 ] 

Vinod Kumar Vavilapalli commented on HIVE-4014:
---

I already tracked it down, will upload a patch soon..

 Hive+RCFile is not doing column pruning and reading much more data than 
 necessary
 -

 Key: HIVE-4014
 URL: https://issues.apache.org/jira/browse/HIVE-4014
 Project: Hive
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 With even simple projection queries, I see that HDFS bytes read counter 
 doesn't show any reduction in the amount of data read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira