Re: Review Request: HIVE-2144 reduce workload generated by JDBCStatsPublisher

2011-05-20 Thread Ning Zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/765/#review696
---



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
https://reviews.apache.org/r/765/#comment1395

Here we need to catch SQLRecoverableException and retry.



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
https://reviews.apache.org/r/765/#comment1396

If these parameters present in conf/hive-default.xml, you don't need to set 
them again here since new JobConf() should read from hive-default.xml.



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
https://reviews.apache.org/r/765/#comment1397

the usual use case for aggregateStats() is that the key should be the 
prefix (e.g., file_000) of the string inserted by publishStats, so that all 
keys that match the prefix will be aggregated.

Can you add one more test for aggregateStats('file_000')?



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
https://reviews.apache.org/r/765/#comment1398

won't this also change the stats at the 2nd publishStat()?



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
https://reviews.apache.org/r/765/#comment1399

also add another aggStats for prefix.


- Ning


On 2011-05-19 23:14:26, Tomasz Nykiel wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/765/
 ---
 
 (Updated 2011-05-19 23:14:26)
 
 
 Review request for hive.
 
 
 Summary
 ---
 
 Currently, the JDBCStatsPublisher executes two queries per inserted row of 
 statistics, first query to check if the ID was inserted by another task, and 
 second query to insert a new or update the existing row.
 The latter occurs very rarely, since duplicates most likely originate from 
 speculative failed tasks.
 
 Currently the schema of the stat table is the following:
 
 PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have 
 any integrity constraints declared.
 
 We amend it to:
 
 PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).
 
 HIVE-2144 improves on performance by greedily performing the insertion 
 statement.
 Then instead of executing two queries per row inserted, we can execute one 
 INSERT query.
 In the case primary key constraint violation, we perform a single UPDATE 
 query.
 The UPDATE query needs to check the condition, if the currently inserted 
 stats are newer then the ones already in the table.
 
 
 This addresses bug HIVE-2144.
 https://issues.apache.org/jira/browse/HIVE-2144
 
 
 Diffs
 -
 
   
 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
  1125140 
   trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 
 PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/765/diff
 
 
 Testing
 ---
 
 TestStatsPublisher JUnit test:
 - basic behaviour
 - multiple updates
 - cleanup of the statistics table after aggregation
 
 Standalone testing on the cluster.
 - insert/analyze queries over non-partitioned/partitioned tables
 
 NOTE. For the correct behaviour, the primary_key index needs to be created, 
 or the PARTITION_STAT_TABLE table dropped - which triggers creation of the 
 table with the constraint declared.
 
 
 Thanks,
 
 Tomasz
 




[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

2011-05-20 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036707#comment-13036707
 ] 

jirapos...@reviews.apache.org commented on HIVE-2144:
-


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/765/#review696
---



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
https://reviews.apache.org/r/765/#comment1395

Here we need to catch SQLRecoverableException and retry.



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
https://reviews.apache.org/r/765/#comment1396

If these parameters present in conf/hive-default.xml, you don't need to set 
them again here since new JobConf() should read from hive-default.xml.



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
https://reviews.apache.org/r/765/#comment1397

the usual use case for aggregateStats() is that the key should be the 
prefix (e.g., file_000) of the string inserted by publishStats, so that all 
keys that match the prefix will be aggregated.

Can you add one more test for aggregateStats('file_000')?



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
https://reviews.apache.org/r/765/#comment1398

won't this also change the stats at the 2nd publishStat()?



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
https://reviews.apache.org/r/765/#comment1399

also add another aggStats for prefix.


- Ning


On 2011-05-19 23:14:26, Tomasz Nykiel wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/765/
bq.  ---
bq.  
bq.  (Updated 2011-05-19 23:14:26)
bq.  
bq.  
bq.  Review request for hive.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  Currently, the JDBCStatsPublisher executes two queries per inserted row of 
statistics, first query to check if the ID was inserted by another task, and 
second query to insert a new or update the existing row.
bq.  The latter occurs very rarely, since duplicates most likely originate from 
speculative failed tasks.
bq.  
bq.  Currently the schema of the stat table is the following:
bq.  
bq.  PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not 
have any integrity constraints declared.
bq.  
bq.  We amend it to:
bq.  
bq.  PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).
bq.  
bq.  HIVE-2144 improves on performance by greedily performing the insertion 
statement.
bq.  Then instead of executing two queries per row inserted, we can execute one 
INSERT query.
bq.  In the case primary key constraint violation, we perform a single UPDATE 
query.
bq.  The UPDATE query needs to check the condition, if the currently inserted 
stats are newer then the ones already in the table.
bq.  
bq.  
bq.  This addresses bug HIVE-2144.
bq.  https://issues.apache.org/jira/browse/HIVE-2144
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 
1125140 
bq.trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 
PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/765/diff
bq.  
bq.  
bq.  Testing
bq.  ---
bq.  
bq.  TestStatsPublisher JUnit test:
bq.  - basic behaviour
bq.  - multiple updates
bq.  - cleanup of the statistics table after aggregation
bq.  
bq.  Standalone testing on the cluster.
bq.  - insert/analyze queries over non-partitioned/partitioned tables
bq.  
bq.  NOTE. For the correct behaviour, the primary_key index needs to be 
created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of 
the table with the constraint declared.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Tomasz
bq.  
bq.



 reduce workload generated by JDBCStatsPublisher
 ---

 Key: HIVE-2144
 URL: https://issues.apache.org/jira/browse/HIVE-2144
 Project: Hive
  Issue Type: Improvement
Reporter: Ning Zhang
Assignee: Tomasz Nykiel
 Attachments: HIVE-2144.patch


 In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID 
 was inserted by another task (mostly likely a speculative or previously 
 failed task). Depending on if the ID is there, an INSERT or UPDATE query was 
 issues. So there are basically 2x of queries per row inserted into the 
 intermediate stats table. This workload could be reduced to 1/2 if we insert 
 it anyway (it is very rare that IDs are duplicated) and use a different SQL 
 query in the aggregation phase to dedup the ID 

[jira] [Commented] (HIVE-2036) Update bitmap indexes for automatic usage

2011-05-20 Thread Marquis Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036709#comment-13036709
 ] 

Marquis Wang commented on HIVE-2036:


Russell is right. hive.index.compact.file is deprecated and replaced with 
hive.index.blockfilter.file (I think). I kept the former around for 
backwards-compatibility reasons, but we should try to avoid using it.

 Update bitmap indexes for automatic usage
 -

 Key: HIVE-2036
 URL: https://issues.apache.org/jira/browse/HIVE-2036
 Project: Hive
  Issue Type: Improvement
  Components: Indexing
Affects Versions: 0.8.0
Reporter: Russell Melick
Assignee: Syed S. Albiz

 HIVE-1644 will provide automatic usage of indexes, and HIVE-1803 adds bitmap 
 index support.  The bitmap code will need to be extended after it is 
 committed to enable automatic use of indexing.  Most work will be focused in 
 the BitmapIndexHandler, which needs to generate the re-entrant QL index 
 query.  There may also be significant work in the IndexPredicateAnalyzer to 
 support predicates with OR's, instead of just AND's as it is currently.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Review Request: HIVE-2117: insert overwrite ignoring partition location

2011-05-20 Thread Patrick Hunt

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/772/
---

Review request for hive and Carl Steinbach.


Summary
---

This change resolves a regression introduced by HIVE-1707, specifically that 
the partition location (set via alter table partition location) is not being 
respected.

I addressed this by using the user specified location (as done originally), 
except in the case with cross-filesystem moves (which was the concern in 1707).


This addresses bug HIVE-2117.
https://issues.apache.org/jira/browse/HIVE-2117


Diffs
-

  ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 916b235 
  ql/src/test/org/apache/hadoop/hive/ql/BaseTestQueries.java PRE-CREATION 
  ql/src/test/org/apache/hadoop/hive/ql/QTestUtil.java 4685471 
  ql/src/test/org/apache/hadoop/hive/ql/TestLocationQueries.java PRE-CREATION 
  ql/src/test/org/apache/hadoop/hive/ql/TestMTQueries.java 8c7c0b8 
  ql/src/test/queries/clientpositive/alter5.q PRE-CREATION 
  ql/src/test/results/clientpositive/alter5.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/772/diff


Testing
---

I added a new test which verifies partition location explicitly - as the 
existing tests ignore this detail. This test failed w/o my fix applied, it 
passes with the fix applied.


Thanks,

Patrick



[jira] [Commented] (HIVE-2117) insert overwrite ignoring partition location

2011-05-20 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036908#comment-13036908
 ] 

jirapos...@reviews.apache.org commented on HIVE-2117:
-


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/772/
---

Review request for hive and Carl Steinbach.


Summary
---

This change resolves a regression introduced by HIVE-1707, specifically that 
the partition location (set via alter table partition location) is not being 
respected.

I addressed this by using the user specified location (as done originally), 
except in the case with cross-filesystem moves (which was the concern in 1707).


This addresses bug HIVE-2117.
https://issues.apache.org/jira/browse/HIVE-2117


Diffs
-

  ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 916b235 
  ql/src/test/org/apache/hadoop/hive/ql/BaseTestQueries.java PRE-CREATION 
  ql/src/test/org/apache/hadoop/hive/ql/QTestUtil.java 4685471 
  ql/src/test/org/apache/hadoop/hive/ql/TestLocationQueries.java PRE-CREATION 
  ql/src/test/org/apache/hadoop/hive/ql/TestMTQueries.java 8c7c0b8 
  ql/src/test/queries/clientpositive/alter5.q PRE-CREATION 
  ql/src/test/results/clientpositive/alter5.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/772/diff


Testing
---

I added a new test which verifies partition location explicitly - as the 
existing tests ignore this detail. This test failed w/o my fix applied, it 
passes with the fix applied.


Thanks,

Patrick



 insert overwrite ignoring partition location
 

 Key: HIVE-2117
 URL: https://issues.apache.org/jira/browse/HIVE-2117
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.7.0, 0.8.0
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Blocker
 Attachments: HIVE-2117_br07.patch, HIVE-2117_br07.patch, 
 HIVE-2117_trunk.patch, data.txt


 The following code works differently in 0.5.0 vs 0.7.0.
 In 0.5.0 the partition location is respected. 
 However in 0.7.0 while the initial partition is create with the specified 
 location path/parta, the insert overwrite ... results in the partition 
 written to path/dt=a (note that path is the same in both cases).
 {code}
 create table foo_stg (bar INT, car INT); 
 load data local inpath 'data.txt' into table foo_stg;
  
 create table foo4 (bar INT, car INT) partitioned by (dt STRING) LOCATION 
 '/user/hive/warehouse/foo4'; 
 alter table foo4 add partition (dt='a') location 
 '/user/hive/warehouse/foo4/parta';
  
 from foo_stg fs insert overwrite table foo4 partition (dt='a') select *;
 {code}
 From what I can tell HIVE-1707 introduced this via a change to
 org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Path, String, 
 MapString, String, boolean, boolean)
 specifically:
 {code}
 +  Path partPath = new Path(tbl.getDataLocation().getPath(),
 +  Warehouse.makePartPath(partSpec));
 +
 +  Path newPartPath = new Path(loadPath.toUri().getScheme(), loadPath
 +  .toUri().getAuthority(), partPath.toUri().getPath());
 {code}
 Reading the description on HIVE-1707 it seems that this may have been done 
 purposefully, however given the partition location is explicitly specified 
 for the partition in question it seems like that should be honored (esp give 
 the table location has not changed).
 This difference in behavior is causing a regression in existing production 
 Hive based code. I'd like to take a stab at addressing this, any suggestions?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2117) insert overwrite ignoring partition location

2011-05-20 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036912#comment-13036912
 ] 

Patrick Hunt commented on HIVE-2117:


I posted reviews up on reviewboard:
trunk: https://reviews.apache.org/r/773/
branch-0.7: https://reviews.apache.org/r/772/


 insert overwrite ignoring partition location
 

 Key: HIVE-2117
 URL: https://issues.apache.org/jira/browse/HIVE-2117
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.7.0, 0.8.0
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Blocker
 Attachments: HIVE-2117_br07.patch, HIVE-2117_br07.patch, 
 HIVE-2117_trunk.patch, data.txt


 The following code works differently in 0.5.0 vs 0.7.0.
 In 0.5.0 the partition location is respected. 
 However in 0.7.0 while the initial partition is create with the specified 
 location path/parta, the insert overwrite ... results in the partition 
 written to path/dt=a (note that path is the same in both cases).
 {code}
 create table foo_stg (bar INT, car INT); 
 load data local inpath 'data.txt' into table foo_stg;
  
 create table foo4 (bar INT, car INT) partitioned by (dt STRING) LOCATION 
 '/user/hive/warehouse/foo4'; 
 alter table foo4 add partition (dt='a') location 
 '/user/hive/warehouse/foo4/parta';
  
 from foo_stg fs insert overwrite table foo4 partition (dt='a') select *;
 {code}
 From what I can tell HIVE-1707 introduced this via a change to
 org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Path, String, 
 MapString, String, boolean, boolean)
 specifically:
 {code}
 +  Path partPath = new Path(tbl.getDataLocation().getPath(),
 +  Warehouse.makePartPath(partSpec));
 +
 +  Path newPartPath = new Path(loadPath.toUri().getScheme(), loadPath
 +  .toUri().getAuthority(), partPath.toUri().getPath());
 {code}
 Reading the description on HIVE-1707 it seems that this may have been done 
 purposefully, however given the partition location is explicitly specified 
 for the partition in question it seems like that should be honored (esp give 
 the table location has not changed).
 This difference in behavior is causing a regression in existing production 
 Hive based code. I'd like to take a stab at addressing this, any suggestions?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2117) insert overwrite ignoring partition location

2011-05-20 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036914#comment-13036914
 ] 

jirapos...@reviews.apache.org commented on HIVE-2117:
-


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/773/
---

Review request for hive and Carl Steinbach.


Summary
---

This change resolves a regression introduced by HIVE-1707, specifically that 
the partition location (set via alter table partition location) is not being 
respected.

I addressed this by using the user specified location (as done originally), 
except in the case with cross-filesystem moves (which was the concern in 1707).


This addresses bug HIVE-2117.
https://issues.apache.org/jira/browse/HIVE-2117


Diffs
-

  ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java bcacd35 
  ql/src/test/org/apache/hadoop/hive/ql/BaseTestQueries.java PRE-CREATION 
  ql/src/test/org/apache/hadoop/hive/ql/QTestUtil.java 06a0447 
  ql/src/test/org/apache/hadoop/hive/ql/TestLocationQueries.java PRE-CREATION 
  ql/src/test/org/apache/hadoop/hive/ql/TestMTQueries.java 8c7c0b8 
  ql/src/test/queries/clientpositive/alter5.q PRE-CREATION 
  ql/src/test/results/clientpositive/alter5.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/773/diff


Testing
---

I added a new test which verifies partition location explicitly - as the 
existing tests ignore this detail. This test failed w/o my fix applied, it 
passes with the fix applied.


Thanks,

Patrick



 insert overwrite ignoring partition location
 

 Key: HIVE-2117
 URL: https://issues.apache.org/jira/browse/HIVE-2117
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.7.0, 0.8.0
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Blocker
 Attachments: HIVE-2117_br07.patch, HIVE-2117_br07.patch, 
 HIVE-2117_trunk.patch, data.txt


 The following code works differently in 0.5.0 vs 0.7.0.
 In 0.5.0 the partition location is respected. 
 However in 0.7.0 while the initial partition is create with the specified 
 location path/parta, the insert overwrite ... results in the partition 
 written to path/dt=a (note that path is the same in both cases).
 {code}
 create table foo_stg (bar INT, car INT); 
 load data local inpath 'data.txt' into table foo_stg;
  
 create table foo4 (bar INT, car INT) partitioned by (dt STRING) LOCATION 
 '/user/hive/warehouse/foo4'; 
 alter table foo4 add partition (dt='a') location 
 '/user/hive/warehouse/foo4/parta';
  
 from foo_stg fs insert overwrite table foo4 partition (dt='a') select *;
 {code}
 From what I can tell HIVE-1707 introduced this via a change to
 org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Path, String, 
 MapString, String, boolean, boolean)
 specifically:
 {code}
 +  Path partPath = new Path(tbl.getDataLocation().getPath(),
 +  Warehouse.makePartPath(partSpec));
 +
 +  Path newPartPath = new Path(loadPath.toUri().getScheme(), loadPath
 +  .toUri().getAuthority(), partPath.toUri().getPath());
 {code}
 Reading the description on HIVE-1707 it seems that this may have been done 
 purposefully, however given the partition location is explicitly specified 
 for the partition in question it seems like that should be honored (esp give 
 the table location has not changed).
 This difference in behavior is causing a regression in existing production 
 Hive based code. I'd like to take a stab at addressing this, any suggestions?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

2011-05-20 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036932#comment-13036932
 ] 

jirapos...@reviews.apache.org commented on HIVE-2144:
-


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/765/#review702
---



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
https://reviews.apache.org/r/765/#comment1402

Yes. That's correct.



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
https://reviews.apache.org/r/765/#comment1403

ok.



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
https://reviews.apache.org/r/765/#comment1404

I will amend the test cases to aggregate over prefixes. I will also add one 
simple test case to aggregate over exact match.



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
https://reviews.apache.org/r/765/#comment1405

The original value inserted in line 120 is 200. Neither 100, nor 150 should 
change the values. 



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
https://reviews.apache.org/r/765/#comment1406

As disscussed before, I will improve the test cases to aggregate over 
prefixes.


- Tomasz


On 2011-05-19 23:14:26, Tomasz Nykiel wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/765/
bq.  ---
bq.  
bq.  (Updated 2011-05-19 23:14:26)
bq.  
bq.  
bq.  Review request for hive.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  Currently, the JDBCStatsPublisher executes two queries per inserted row of 
statistics, first query to check if the ID was inserted by another task, and 
second query to insert a new or update the existing row.
bq.  The latter occurs very rarely, since duplicates most likely originate from 
speculative failed tasks.
bq.  
bq.  Currently the schema of the stat table is the following:
bq.  
bq.  PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not 
have any integrity constraints declared.
bq.  
bq.  We amend it to:
bq.  
bq.  PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).
bq.  
bq.  HIVE-2144 improves on performance by greedily performing the insertion 
statement.
bq.  Then instead of executing two queries per row inserted, we can execute one 
INSERT query.
bq.  In the case primary key constraint violation, we perform a single UPDATE 
query.
bq.  The UPDATE query needs to check the condition, if the currently inserted 
stats are newer then the ones already in the table.
bq.  
bq.  
bq.  This addresses bug HIVE-2144.
bq.  https://issues.apache.org/jira/browse/HIVE-2144
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 
1125140 
bq.trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 
PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/765/diff
bq.  
bq.  
bq.  Testing
bq.  ---
bq.  
bq.  TestStatsPublisher JUnit test:
bq.  - basic behaviour
bq.  - multiple updates
bq.  - cleanup of the statistics table after aggregation
bq.  
bq.  Standalone testing on the cluster.
bq.  - insert/analyze queries over non-partitioned/partitioned tables
bq.  
bq.  NOTE. For the correct behaviour, the primary_key index needs to be 
created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of 
the table with the constraint declared.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Tomasz
bq.  
bq.



 reduce workload generated by JDBCStatsPublisher
 ---

 Key: HIVE-2144
 URL: https://issues.apache.org/jira/browse/HIVE-2144
 Project: Hive
  Issue Type: Improvement
Reporter: Ning Zhang
Assignee: Tomasz Nykiel
 Attachments: HIVE-2144.patch


 In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID 
 was inserted by another task (mostly likely a speculative or previously 
 failed task). Depending on if the ID is there, an INSERT or UPDATE query was 
 issues. So there are basically 2x of queries per row inserted into the 
 intermediate stats table. This workload could be reduced to 1/2 if we insert 
 it anyway (it is very rare that IDs are duplicated) and use a different SQL 
 query in the aggregation phase to dedup the ID (e.g., using group-by and 
 max()). The benefits are that even though the aggregation query is more 
 expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: 

Re: Review Request: HIVE-2144 reduce workload generated by JDBCStatsPublisher

2011-05-20 Thread Tomasz Nykiel

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/765/#review702
---



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
https://reviews.apache.org/r/765/#comment1402

Yes. That's correct.



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
https://reviews.apache.org/r/765/#comment1403

ok.



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
https://reviews.apache.org/r/765/#comment1404

I will amend the test cases to aggregate over prefixes. I will also add one 
simple test case to aggregate over exact match.



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
https://reviews.apache.org/r/765/#comment1405

The original value inserted in line 120 is 200. Neither 100, nor 150 should 
change the values. 



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
https://reviews.apache.org/r/765/#comment1406

As disscussed before, I will improve the test cases to aggregate over 
prefixes.


- Tomasz


On 2011-05-19 23:14:26, Tomasz Nykiel wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/765/
 ---
 
 (Updated 2011-05-19 23:14:26)
 
 
 Review request for hive.
 
 
 Summary
 ---
 
 Currently, the JDBCStatsPublisher executes two queries per inserted row of 
 statistics, first query to check if the ID was inserted by another task, and 
 second query to insert a new or update the existing row.
 The latter occurs very rarely, since duplicates most likely originate from 
 speculative failed tasks.
 
 Currently the schema of the stat table is the following:
 
 PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have 
 any integrity constraints declared.
 
 We amend it to:
 
 PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).
 
 HIVE-2144 improves on performance by greedily performing the insertion 
 statement.
 Then instead of executing two queries per row inserted, we can execute one 
 INSERT query.
 In the case primary key constraint violation, we perform a single UPDATE 
 query.
 The UPDATE query needs to check the condition, if the currently inserted 
 stats are newer then the ones already in the table.
 
 
 This addresses bug HIVE-2144.
 https://issues.apache.org/jira/browse/HIVE-2144
 
 
 Diffs
 -
 
   
 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
  1125140 
   trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 
 PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/765/diff
 
 
 Testing
 ---
 
 TestStatsPublisher JUnit test:
 - basic behaviour
 - multiple updates
 - cleanup of the statistics table after aggregation
 
 Standalone testing on the cluster.
 - insert/analyze queries over non-partitioned/partitioned tables
 
 NOTE. For the correct behaviour, the primary_key index needs to be created, 
 or the PARTITION_STAT_TABLE table dropped - which triggers creation of the 
 table with the constraint declared.
 
 
 Thanks,
 
 Tomasz
 




[jira] [Updated] (HIVE-2096) throw a error if the input is larger than a threshold for index input format

2011-05-20 Thread He Yongqiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Yongqiang updated HIVE-2096:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed! Thanks Wojciech! 
(still trying to figure out how to assign this task to Wojciech :) )

 throw a error if the input is larger than a threshold for index input format
 

 Key: HIVE-2096
 URL: https://issues.apache.org/jira/browse/HIVE-2096
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Namit Jain
 Attachments: HIVE-2096.1.patch.txt, HIVE-2096.2.patch.txt, 
 HIVE-2096.3.patch.txt, HIVE-2096.4.patch.txt


 This can hang for ever.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-2100) virtual column references inside subqueries cause execution exceptions

2011-05-20 Thread He Yongqiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Yongqiang updated HIVE-2100:
---

Status: Patch Available  (was: Open)

 virtual column references inside subqueries cause execution exceptions
 --

 Key: HIVE-2100
 URL: https://issues.apache.org/jira/browse/HIVE-2100
 Project: Hive
  Issue Type: Bug
Reporter: Joydeep Sen Sarma
 Attachments: HIVE-2100.txt


 example:
 create table jssarma_nilzma_bad as select a.fname, a.offset, a.val from 
 (select 
 hash(eventid,userid,eventtime,browsercookie,userstate,useragent,userip,serverip,clienttime,geoid,countrycode\
 ,actionid,lastimpressionid,lastnavimpressionid,impressiontype,fullurl,fullreferrer,pagesection,modulesection,adsection)
  as val, INPUT__FILE__NAME as fname, BLOCK__OFFSET__INSIDE__FILE as offset 
 from nectar_impression_lzma_unverified where ds='2010-07-28') a join 
 jssarma_hc_diff b on (a.val=b.val);
 causes
 Caused by: java.lang.RuntimeException: Map operator initialization failed
   at 
 org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:121)
   ... 18 more
 Caused by: java.lang.RuntimeException: cannot find field input__file__name 
 from 
 [org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@664310d0,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@3d04fc23,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@12457d21,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@101a0ae6,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@1dc18a4c,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@d5e92d7,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@3bfa681c,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@34c92507,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@19e09a4,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@2e8aeed0,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@2344b18f,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@72e5355f,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@26132ae7,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@3465b738,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@1dfd868,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@ef894ce,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@61f1680f,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@2fe6e305,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@5f4275d4,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@445e228,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@802b249]
   at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:321)
   at 
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector.getStructFieldRef(UnionStructObjectInspector.java:96)
   at 
 org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:57)
   at 
 org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:878)
   at 
 org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:904)
   at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:60)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)
   at 
 org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389)
   at 
 org.apache.hadoop.hive.ql.exec.FilterOperator.initializeOp(FilterOperator.java:73)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)
   at 
 org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389)
   at 
 org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:133)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
   at 
 org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:444)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
   

[jira] [Commented] (HIVE-2100) virtual column references inside subqueries cause execution exceptions

2011-05-20 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036946#comment-13036946
 ] 

He Yongqiang commented on HIVE-2100:


running tests, and also put this into PA queue

 virtual column references inside subqueries cause execution exceptions
 --

 Key: HIVE-2100
 URL: https://issues.apache.org/jira/browse/HIVE-2100
 Project: Hive
  Issue Type: Bug
Reporter: Joydeep Sen Sarma
 Attachments: HIVE-2100.txt


 example:
 create table jssarma_nilzma_bad as select a.fname, a.offset, a.val from 
 (select 
 hash(eventid,userid,eventtime,browsercookie,userstate,useragent,userip,serverip,clienttime,geoid,countrycode\
 ,actionid,lastimpressionid,lastnavimpressionid,impressiontype,fullurl,fullreferrer,pagesection,modulesection,adsection)
  as val, INPUT__FILE__NAME as fname, BLOCK__OFFSET__INSIDE__FILE as offset 
 from nectar_impression_lzma_unverified where ds='2010-07-28') a join 
 jssarma_hc_diff b on (a.val=b.val);
 causes
 Caused by: java.lang.RuntimeException: Map operator initialization failed
   at 
 org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:121)
   ... 18 more
 Caused by: java.lang.RuntimeException: cannot find field input__file__name 
 from 
 [org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@664310d0,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@3d04fc23,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@12457d21,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@101a0ae6,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@1dc18a4c,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@d5e92d7,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@3bfa681c,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@34c92507,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@19e09a4,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@2e8aeed0,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@2344b18f,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@72e5355f,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@26132ae7,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@3465b738,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@1dfd868,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@ef894ce,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@61f1680f,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@2fe6e305,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@5f4275d4,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@445e228,
  
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@802b249]
   at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:321)
   at 
 org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector.getStructFieldRef(UnionStructObjectInspector.java:96)
   at 
 org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:57)
   at 
 org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:878)
   at 
 org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:904)
   at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:60)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)
   at 
 org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389)
   at 
 org.apache.hadoop.hive.ql.exec.FilterOperator.initializeOp(FilterOperator.java:73)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)
   at 
 org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389)
   at 
 org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:133)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
   at 
 org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:444)
   at 

[jira] [Updated] (HIVE-2117) insert overwrite ignoring partition location

2011-05-20 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated HIVE-2117:
---

Status: Patch Available  (was: Open)

 insert overwrite ignoring partition location
 

 Key: HIVE-2117
 URL: https://issues.apache.org/jira/browse/HIVE-2117
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.7.0, 0.8.0
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Blocker
 Attachments: HIVE-2117_br07.patch, HIVE-2117_br07.patch, 
 HIVE-2117_trunk.patch, data.txt


 The following code works differently in 0.5.0 vs 0.7.0.
 In 0.5.0 the partition location is respected. 
 However in 0.7.0 while the initial partition is create with the specified 
 location path/parta, the insert overwrite ... results in the partition 
 written to path/dt=a (note that path is the same in both cases).
 {code}
 create table foo_stg (bar INT, car INT); 
 load data local inpath 'data.txt' into table foo_stg;
  
 create table foo4 (bar INT, car INT) partitioned by (dt STRING) LOCATION 
 '/user/hive/warehouse/foo4'; 
 alter table foo4 add partition (dt='a') location 
 '/user/hive/warehouse/foo4/parta';
  
 from foo_stg fs insert overwrite table foo4 partition (dt='a') select *;
 {code}
 From what I can tell HIVE-1707 introduced this via a change to
 org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Path, String, 
 MapString, String, boolean, boolean)
 specifically:
 {code}
 +  Path partPath = new Path(tbl.getDataLocation().getPath(),
 +  Warehouse.makePartPath(partSpec));
 +
 +  Path newPartPath = new Path(loadPath.toUri().getScheme(), loadPath
 +  .toUri().getAuthority(), partPath.toUri().getPath());
 {code}
 Reading the description on HIVE-1707 it seems that this may have been done 
 purposefully, however given the partition location is explicitly specified 
 for the partition in question it seems like that should be honored (esp give 
 the table location has not changed).
 This difference in behavior is causing a regression in existing production 
 Hive based code. I'd like to take a stab at addressing this, any suggestions?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2117) insert overwrite ignoring partition location

2011-05-20 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13037096#comment-13037096
 ] 

Carl Steinbach commented on HIVE-2117:
--

+1. Will commit if tests pass.

 insert overwrite ignoring partition location
 

 Key: HIVE-2117
 URL: https://issues.apache.org/jira/browse/HIVE-2117
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.7.0, 0.8.0
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Blocker
 Attachments: HIVE-2117_br07.patch, HIVE-2117_br07.patch, 
 HIVE-2117_trunk.patch, data.txt


 The following code works differently in 0.5.0 vs 0.7.0.
 In 0.5.0 the partition location is respected. 
 However in 0.7.0 while the initial partition is create with the specified 
 location path/parta, the insert overwrite ... results in the partition 
 written to path/dt=a (note that path is the same in both cases).
 {code}
 create table foo_stg (bar INT, car INT); 
 load data local inpath 'data.txt' into table foo_stg;
  
 create table foo4 (bar INT, car INT) partitioned by (dt STRING) LOCATION 
 '/user/hive/warehouse/foo4'; 
 alter table foo4 add partition (dt='a') location 
 '/user/hive/warehouse/foo4/parta';
  
 from foo_stg fs insert overwrite table foo4 partition (dt='a') select *;
 {code}
 From what I can tell HIVE-1707 introduced this via a change to
 org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Path, String, 
 MapString, String, boolean, boolean)
 specifically:
 {code}
 +  Path partPath = new Path(tbl.getDataLocation().getPath(),
 +  Warehouse.makePartPath(partSpec));
 +
 +  Path newPartPath = new Path(loadPath.toUri().getScheme(), loadPath
 +  .toUri().getAuthority(), partPath.toUri().getPath());
 {code}
 Reading the description on HIVE-1707 it seems that this may have been done 
 purposefully, however given the partition location is explicitly specified 
 for the partition in question it seems like that should be honored (esp give 
 the table location has not changed).
 This difference in behavior is causing a regression in existing production 
 Hive based code. I'd like to take a stab at addressing this, any suggestions?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2175) NPE if zookeeper is down

2011-05-20 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13037166#comment-13037166
 ] 

Namit Jain commented on HIVE-2175:
--

We should throw a more meaningful exception.


 NPE if zookeeper is down
 

 Key: HIVE-2175
 URL: https://issues.apache.org/jira/browse/HIVE-2175
 Project: Hive
  Issue Type: Bug
Reporter: Namit Jain
Assignee: He Yongqiang

 ERROR ZooKeeperHiveLockManager (ZooKeeperHiveLockManager.java:lock(337)) - 
 Failed to get ZooKeeper lock: java.lang.NullPointerException

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-2175) NPE if zookeeper is down

2011-05-20 Thread Namit Jain (JIRA)
NPE if zookeeper is down


 Key: HIVE-2175
 URL: https://issues.apache.org/jira/browse/HIVE-2175
 Project: Hive
  Issue Type: Bug
Reporter: Namit Jain
Assignee: He Yongqiang


ERROR ZooKeeperHiveLockManager (ZooKeeperHiveLockManager.java:lock(337)) - 
Failed to get ZooKeeper lock: java.lang.NullPointerException

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-2176) Schema creation scripts are incomplete since they leave out tables that are specific to DataNucleus

2011-05-20 Thread Esteban Gutierrez (JIRA)
Schema creation scripts are incomplete since they leave out tables that are 
specific to DataNucleus
---

 Key: HIVE-2176
 URL: https://issues.apache.org/jira/browse/HIVE-2176
 Project: Hive
  Issue Type: Bug
  Components: Configuration, Metastore
Affects Versions: 0.7.0
Reporter: Esteban Gutierrez


When using the DDL SQL scripts to create the Metastore, tables like 
SEQUENCE_TABLE are missing and force the user to change the configuration to 
use Datanucleus to do all the provisioning of the Metastore tables. Adding the 
missing table definitions to the DDL scripts will allow to have a functional 
Hive Metastore without enabling additional privileges to the Metastore user 
and/or enabling datanucleus.autoCreateSchema property in hive-site.xml



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-2176) Schema creation scripts are incomplete since they leave out tables that are specific to DataNucleus

2011-05-20 Thread Esteban Gutierrez (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Esteban Gutierrez updated HIVE-2176:


Description: 
When using the DDL SQL scripts to create the Metastore, tables like 
SEQUENCE_TABLE are missing and force the user to change the configuration to 
use Datanucleus to do all the provisioning of the Metastore tables. Adding the 
missing table definitions to the DDL scripts will allow to have a functional 
Hive Metastore without enabling additional privileges to the Metastore user 
and/or enabling datanucleus.autoCreateSchema property in hive-site.xml


[After running the hive-schema-0.7.0.mysql.sql and revoking ALTER and CREATE 
privileges to the 'metastoreuser']

hive show tables; 
FAILED: Error in metadata: javax.jdo.JDOException: Exception thrown calling 
table.exists() for `SEQUENCE_TABLE` 
NestedThrowables: 
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: CREATE command 
denied to user 'metastoreuser'@'localhost' for table 'SEQUENCE_TABLE' 
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask



  was:
When using the DDL SQL scripts to create the Metastore, tables like 
SEQUENCE_TABLE are missing and force the user to change the configuration to 
use Datanucleus to do all the provisioning of the Metastore tables. Adding the 
missing table definitions to the DDL scripts will allow to have a functional 
Hive Metastore without enabling additional privileges to the Metastore user 
and/or enabling datanucleus.autoCreateSchema property in hive-site.xml




 Schema creation scripts are incomplete since they leave out tables that are 
 specific to DataNucleus
 ---

 Key: HIVE-2176
 URL: https://issues.apache.org/jira/browse/HIVE-2176
 Project: Hive
  Issue Type: Bug
  Components: Configuration, Metastore
Affects Versions: 0.7.0
Reporter: Esteban Gutierrez
  Labels: derby, mysql, postgres

 When using the DDL SQL scripts to create the Metastore, tables like 
 SEQUENCE_TABLE are missing and force the user to change the configuration to 
 use Datanucleus to do all the provisioning of the Metastore tables. Adding 
 the missing table definitions to the DDL scripts will allow to have a 
 functional Hive Metastore without enabling additional privileges to the 
 Metastore user and/or enabling datanucleus.autoCreateSchema property in 
 hive-site.xml
 [After running the hive-schema-0.7.0.mysql.sql and revoking ALTER and CREATE 
 privileges to the 'metastoreuser']
 hive show tables; 
 FAILED: Error in metadata: javax.jdo.JDOException: Exception thrown calling 
 table.exists() for `SEQUENCE_TABLE` 
 NestedThrowables: 
 com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: CREATE command 
 denied to user 'metastoreuser'@'localhost' for table 'SEQUENCE_TABLE' 
 FAILED: Execution Error, return code 1 from 
 org.apache.hadoop.hive.ql.exec.DDLTask

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

2011-05-20 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13037221#comment-13037221
 ] 

jirapos...@reviews.apache.org commented on HIVE-2144:
-


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/765/
---

(Updated 2011-05-21 01:49:07.819494)


Review request for hive.


Changes
---

-Amended the test cases to accommodate prefix aggregation
-Fixed unnecessary conf settings
-Fixed exception handling in JDBCStatsPublisher.publishStats - 
SQLRecoverableException is handled when executing the update statement.


Summary
---

Currently, the JDBCStatsPublisher executes two queries per inserted row of 
statistics, first query to check if the ID was inserted by another task, and 
second query to insert a new or update the existing row.
The latter occurs very rarely, since duplicates most likely originate from 
speculative failed tasks.

Currently the schema of the stat table is the following:

PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have 
any integrity constraints declared.

We amend it to:

PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).

HIVE-2144 improves on performance by greedily performing the insertion 
statement.
Then instead of executing two queries per row inserted, we can execute one 
INSERT query.
In the case primary key constraint violation, we perform a single UPDATE query.
The UPDATE query needs to check the condition, if the currently inserted stats 
are newer then the ones already in the table.


This addresses bug HIVE-2144.
https://issues.apache.org/jira/browse/HIVE-2144


Diffs (updated)
-

  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 
1125468 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 
PRE-CREATION 

Diff: https://reviews.apache.org/r/765/diff


Testing
---

TestStatsPublisher JUnit test:
- basic behaviour
- multiple updates
- cleanup of the statistics table after aggregation

Standalone testing on the cluster.
- insert/analyze queries over non-partitioned/partitioned tables

NOTE. For the correct behaviour, the primary_key index needs to be created, or 
the PARTITION_STAT_TABLE table dropped - which triggers creation of the table 
with the constraint declared.


Thanks,

Tomasz



 reduce workload generated by JDBCStatsPublisher
 ---

 Key: HIVE-2144
 URL: https://issues.apache.org/jira/browse/HIVE-2144
 Project: Hive
  Issue Type: Improvement
Reporter: Ning Zhang
Assignee: Tomasz Nykiel
 Attachments: HIVE-2144.patch


 In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID 
 was inserted by another task (mostly likely a speculative or previously 
 failed task). Depending on if the ID is there, an INSERT or UPDATE query was 
 issues. So there are basically 2x of queries per row inserted into the 
 intermediate stats table. This workload could be reduced to 1/2 if we insert 
 it anyway (it is very rare that IDs are duplicated) and use a different SQL 
 query in the aggregation phase to dedup the ID (e.g., using group-by and 
 max()). The benefits are that even though the aggregation query is more 
 expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

2011-05-20 Thread Tomasz Nykiel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomasz Nykiel updated HIVE-2144:


Attachment: HIVE-2144.1.patch

Fixed after revision 1.

 reduce workload generated by JDBCStatsPublisher
 ---

 Key: HIVE-2144
 URL: https://issues.apache.org/jira/browse/HIVE-2144
 Project: Hive
  Issue Type: Improvement
Reporter: Ning Zhang
Assignee: Tomasz Nykiel
 Attachments: HIVE-2144.1.patch, HIVE-2144.patch


 In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID 
 was inserted by another task (mostly likely a speculative or previously 
 failed task). Depending on if the ID is there, an INSERT or UPDATE query was 
 issues. So there are basically 2x of queries per row inserted into the 
 intermediate stats table. This workload could be reduced to 1/2 if we insert 
 it anyway (it is very rare that IDs are duplicated) and use a different SQL 
 query in the aggregation phase to dedup the ID (e.g., using group-by and 
 max()). The benefits are that even though the aggregation query is more 
 expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira