[jira] [Commented] (HIVE-9674) *DropPartitionEvent should handle partition-sets.

2015-03-25 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380320#comment-14380320
 ] 

Mithun Radhakrishnan commented on HIVE-9674:


Sush, could you please review this one? I'd like to avoid another rebase.

 *DropPartitionEvent should handle partition-sets.
 -

 Key: HIVE-9674
 URL: https://issues.apache.org/jira/browse/HIVE-9674
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9674.2.patch, HIVE-9736.3.patch, HIVE-9736.4.patch


 Dropping a set of N partitions from a table currently results in N 
 DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
 is wasteful, especially so for large N. It also makes it impossible to even 
 try to run authorization-checks on all partitions in a batch.
 Taking the cue from HIVE-9609, we should compose an {{IterablePartition}} 
 in the event, and expose them via an {{Iterator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9582) HCatalog should use IMetaStoreClient interface

2015-03-25 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381017#comment-14381017
 ] 

Mithun Radhakrishnan commented on HIVE-9582:


I wonder, should {{HCatUtils}} be package-protected?

 HCatalog should use IMetaStoreClient interface
 --

 Key: HIVE-9582
 URL: https://issues.apache.org/jira/browse/HIVE-9582
 Project: Hive
  Issue Type: Sub-task
  Components: HCatalog, Metastore
Affects Versions: 0.14.0, 0.13.1
Reporter: Thiruvel Thirumoolan
Assignee: Thiruvel Thirumoolan
  Labels: hcatalog, metastore, rolling_upgrade
 Attachments: HIVE-9582.1.patch, HIVE-9582.2.patch, HIVE-9582.3.patch, 
 HIVE-9582.4.patch, HIVE-9582.5.patch, HIVE-9583.1.patch


 Hive uses IMetaStoreClient and it makes using RetryingMetaStoreClient easy. 
 Hence during a failure, the client retries and possibly succeeds. But 
 HCatalog has long been using HiveMetaStoreClient directly and hence failures 
 are costly, especially if they are during the commit stage of a job. Its also 
 not possible to do rolling upgrade of MetaStore Server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9845) HCatSplit repeats information making input split data size huge

2015-03-30 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9845:
---
Attachment: HIVE-9845.2.patch

 HCatSplit repeats information making input split data size huge
 ---

 Key: HIVE-9845
 URL: https://issues.apache.org/jira/browse/HIVE-9845
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Reporter: Rohini Palaniswamy
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9845.1.patch, HIVE-9845.2.patch


 Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which 
 has even triple the number of splits(100K+ splits and tasks) does not hit 
 that issue.
 {code}
 HCatBaseInputFormat.java:
  //Call getSplit on the InputFormat, create an
   //HCatSplit for each underlying split
   //NumSplits is 0 for our purposes
   org.apache.hadoop.mapred.InputSplit[] baseSplits = 
 inputFormat.getSplits(jobConf, 0);
   for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
 splits.add(new HCatSplit(
 partitionInfo,
 split,allCols));
   }
 {code}
 Each hcatSplit duplicates partition schema and table schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9845) HCatSplit repeats information making input split data size huge

2015-03-31 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9845:
---
Attachment: (was: HIVE-9845.2.patch)

 HCatSplit repeats information making input split data size huge
 ---

 Key: HIVE-9845
 URL: https://issues.apache.org/jira/browse/HIVE-9845
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Reporter: Rohini Palaniswamy
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9845.1.patch


 Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which 
 has even triple the number of splits(100K+ splits and tasks) does not hit 
 that issue.
 {code}
 HCatBaseInputFormat.java:
  //Call getSplit on the InputFormat, create an
   //HCatSplit for each underlying split
   //NumSplits is 0 for our purposes
   org.apache.hadoop.mapred.InputSplit[] baseSplits = 
 inputFormat.getSplits(jobConf, 0);
   for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
 splits.add(new HCatSplit(
 partitionInfo,
 split,allCols));
   }
 {code}
 Each hcatSplit duplicates partition schema and table schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9845) HCatSplit repeats information making input split data size huge

2015-03-31 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9845:
---
Attachment: HIVE-9845.3.patch

Another take on the first patch. Except, with more logging, and a correction to 
{{TestHCatOutputFormat}}.

 HCatSplit repeats information making input split data size huge
 ---

 Key: HIVE-9845
 URL: https://issues.apache.org/jira/browse/HIVE-9845
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Reporter: Rohini Palaniswamy
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9845.1.patch, HIVE-9845.3.patch


 Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which 
 has even triple the number of splits(100K+ splits and tasks) does not hit 
 that issue.
 {code}
 HCatBaseInputFormat.java:
  //Call getSplit on the InputFormat, create an
   //HCatSplit for each underlying split
   //NumSplits is 0 for our purposes
   org.apache.hadoop.mapred.InputSplit[] baseSplits = 
 inputFormat.getSplits(jobConf, 0);
   for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
 splits.add(new HCatSplit(
 partitionInfo,
 split,allCols));
   }
 {code}
 Each hcatSplit duplicates partition schema and table schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9845) HCatSplit repeats information making input split data size huge

2015-03-31 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388925#comment-14388925
 ] 

Mithun Radhakrishnan commented on HIVE-9845:


Bah, finally. Unrelated test-failures.

 HCatSplit repeats information making input split data size huge
 ---

 Key: HIVE-9845
 URL: https://issues.apache.org/jira/browse/HIVE-9845
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Reporter: Rohini Palaniswamy
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9845.1.patch, HIVE-9845.3.patch


 Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which 
 has even triple the number of splits(100K+ splits and tasks) does not hit 
 that issue.
 {code}
 HCatBaseInputFormat.java:
  //Call getSplit on the InputFormat, create an
   //HCatSplit for each underlying split
   //NumSplits is 0 for our purposes
   org.apache.hadoop.mapred.InputSplit[] baseSplits = 
 inputFormat.getSplits(jobConf, 0);
   for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
 splits.add(new HCatSplit(
 partitionInfo,
 split,allCols));
   }
 {code}
 Each hcatSplit duplicates partition schema and table schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9845) HCatSplit repeats information making input split data size huge

2015-03-31 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9845:
---
Attachment: (was: HIVE-9845.3.patch)

 HCatSplit repeats information making input split data size huge
 ---

 Key: HIVE-9845
 URL: https://issues.apache.org/jira/browse/HIVE-9845
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Reporter: Rohini Palaniswamy
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9845.1.patch, HIVE-9845.3.patch


 Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which 
 has even triple the number of splits(100K+ splits and tasks) does not hit 
 that issue.
 {code}
 HCatBaseInputFormat.java:
  //Call getSplit on the InputFormat, create an
   //HCatSplit for each underlying split
   //NumSplits is 0 for our purposes
   org.apache.hadoop.mapred.InputSplit[] baseSplits = 
 inputFormat.getSplits(jobConf, 0);
   for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
 splits.add(new HCatSplit(
 partitionInfo,
 split,allCols));
   }
 {code}
 Each hcatSplit duplicates partition schema and table schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9845) HCatSplit repeats information making input split data size huge

2015-03-31 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9845:
---
Attachment: HIVE-9845.3.patch

 HCatSplit repeats information making input split data size huge
 ---

 Key: HIVE-9845
 URL: https://issues.apache.org/jira/browse/HIVE-9845
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Reporter: Rohini Palaniswamy
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9845.1.patch, HIVE-9845.3.patch


 Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which 
 has even triple the number of splits(100K+ splits and tasks) does not hit 
 that issue.
 {code}
 HCatBaseInputFormat.java:
  //Call getSplit on the InputFormat, create an
   //HCatSplit for each underlying split
   //NumSplits is 0 for our purposes
   org.apache.hadoop.mapred.InputSplit[] baseSplits = 
 inputFormat.getSplits(jobConf, 0);
   for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
 splits.add(new HCatSplit(
 partitionInfo,
 split,allCols));
   }
 {code}
 Each hcatSplit duplicates partition schema and table schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9674) *DropPartitionEvent should handle partition-sets.

2015-03-02 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9674:
---
Attachment: HIVE-9736.3.patch

[~cdrome] has me know (thank you!) that I'd neglected to change 
{{TestMetaStoreEventListener}} for this change. Here's the emended patch.

 *DropPartitionEvent should handle partition-sets.
 -

 Key: HIVE-9674
 URL: https://issues.apache.org/jira/browse/HIVE-9674
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9674.2.patch, HIVE-9736.3.patch


 Dropping a set of N partitions from a table currently results in N 
 DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
 is wasteful, especially so for large N. It also makes it impossible to even 
 try to run authorization-checks on all partitions in a batch.
 Taking the cue from HIVE-9609, we should compose an {{IterablePartition}} 
 in the event, and expose them via an {{Iterator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9118) Support auto-purge for tables, when dropping tables/partitions.

2015-03-03 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345436#comment-14345436
 ] 

Mithun Radhakrishnan commented on HIVE-9118:


Thanks for the review and commit, sir. Much appreciated.

Could I please bother you for advice on HIVE-9086? We're having trouble 
reaching consensus on what the grammar should look like, for {{DROP PARTITIONS 
... PURGE}}.

 Support auto-purge for tables, when dropping tables/partitions.
 ---

 Key: HIVE-9118
 URL: https://issues.apache.org/jira/browse/HIVE-9118
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Affects Versions: 1.0.0, 1.1
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 1.2.0

 Attachments: HIVE-9118.1.patch, HIVE-9118.2.patch, HIVE-9118.3.patch


 HIVE-7100 introduced a way to skip the trash directory, when deleting 
 table-data, while dropping tables.
 In HIVE-9083/HIVE-9086, I extended this to work when partitions are dropped.
 Here, I propose a table-parameter ({{auto.purge}}) to set up tables to 
 skip-trash when table/partition data is deleted, without needing to say 
 PURGE on the Hive CLI. Apropos, on {{dropTable()}} and {{dropPartition()}}, 
 table data is deleted directly (and not moved to trash) if the following hold 
 true:
 # The table is MANAGED.
 # The {{deleteData}} parameter to the {{HMSC.drop*()}} methods is true.
 # Either PURGE is explicitly specified on the command-line (or rather, 
 {{ifPurge}} is set in the environment context, OR
 # TBLPROPERTIES contains {{auto.purge=true}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9674) *DropPartitionEvent should handle partition-sets.

2015-03-04 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9674:
---
Attachment: HIVE-9736.4.patch

Here's an updated patch to decouple from HIVE-9609. One function is duplicated 
in {{JSONMessageFactory}}. (Sorry, Sush.) 

 *DropPartitionEvent should handle partition-sets.
 -

 Key: HIVE-9674
 URL: https://issues.apache.org/jira/browse/HIVE-9674
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9674.2.patch, HIVE-9736.3.patch, HIVE-9736.4.patch


 Dropping a set of N partitions from a table currently results in N 
 DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
 is wasteful, especially so for large N. It also makes it impossible to even 
 try to run authorization-checks on all partitions in a batch.
 Taking the cue from HIVE-9609, we should compose an {{IterablePartition}} 
 in the event, and expose them via an {{Iterator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9629) HCatClient.dropPartitions() needs speeding up.

2015-02-23 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333651#comment-14333651
 ] 

Mithun Radhakrishnan commented on HIVE-9629:


Just an update on performance numbers: (A follow-on to those quoted in 
HIVE-9588)

1. Dropping 2K partitions from a managed Hive table took 204 seconds on my 
Hive/HCat test setup (with remote metastore, backed with Oracle).
2. HIVE-9588 reduced this to 83 seconds.
3. The combination of HIVE-9631, HIVE-9681 and HIVE-9736 has reduced this now 
to 16 seconds.
(The patch for HIVE-9631 isn't currently up. Selina has an internal patch that 
works with Oracle.)

I'll be testing this some more. In the meantime, I'd be grateful if the patches 
(other than HIVE-9631) could be reviewed.
 

 HCatClient.dropPartitions() needs speeding up.
 --

 Key: HIVE-9629
 URL: https://issues.apache.org/jira/browse/HIVE-9629
 Project: Hive
  Issue Type: Bug
  Components: HCatalog, Metastore
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan

 This is an über JIRA for the work required to speed up 
 HCatClient.dropPartitions().
 As it stands right now, {{dropPartitions()}} is slow because it takes N 
 thrift-calls to drop N partitions, and attempts to store all N partitions in 
 memory while it executes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9086) Add language support to PURGE data while dropping partitions.

2015-02-25 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337520#comment-14337520
 ] 

Mithun Radhakrishnan commented on HIVE-9086:


Judging from [the 
patch|https://issues.apache.org/jira/secure/attachment/12670435/HIVE-7100.11.patch#file-12],
 HIVE-7100 added the drop-table-purge functionality to read thus:

{code:sql}
DROP TABLE IF EXISTS my_doomed_table PURGE;
{code}

The current alter table drop partitions reads as follows:

{code:sql}
ALTER TABLE my_doomed_table DROP IF EXISTS PARTITION (part_key = sayonara) 
IGNORE PROTECTION;
{code}

HIVE-9086 extends HIVE-7100's purge-functionality to partitions, and suggests 
that the {{PURGE}} keyword go at the end, thus:

{code:sql}
ALTER TABLE my_doomed_table DROP IF EXISTS PARTITION (part_key = sayonara) 
IGNORE PROTECTION PURGE;
{code}

Should {{PURGE}} sit before/after {{IF EXISTS}} or after {{IGNORE PROTECTION}}?

We can't break backward compatibility, so we shouldn't be changing what we 
released in 0.14.

 Add language support to PURGE data while dropping partitions.
 -

 Key: HIVE-9086
 URL: https://issues.apache.org/jira/browse/HIVE-9086
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Affects Versions: 0.15.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9086.1.patch


 HIVE-9083 adds metastore-support to skip-trash while dropping partitions. 
 This patch includes language support to do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10227) Concrete implementation of Export/Import based ReplicationTaskFactory

2015-04-19 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502057#comment-14502057
 ] 

Mithun Radhakrishnan commented on HIVE-10227:
-

+1 (for the first time :]).

I've updated the review-board. I'll close the review.

 Concrete implementation of Export/Import based ReplicationTaskFactory
 -

 Key: HIVE-10227
 URL: https://issues.apache.org/jira/browse/HIVE-10227
 Project: Hive
  Issue Type: Sub-task
  Components: Import/Export
Affects Versions: 1.2.0
Reporter: Sushanth Sowmyan
Assignee: Sushanth Sowmyan
 Attachments: HIVE-10227.2.patch, HIVE-10227.3.patch, HIVE-10227.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9736) StorageBasedAuthProvider should batch namenode-calls where possible.

2015-04-29 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518834#comment-14518834
 ] 

Mithun Radhakrishnan commented on HIVE-9736:


Hello, Chris. 

bq. ... we can combine the multiple actions by using FsAction#or, and then call 
accessMethod.invoke just once...

Yikes! I might've missed incorporating that suggestion by accident. Thank you 
for following up. I'll update the patch shortly.

 StorageBasedAuthProvider should batch namenode-calls where possible.
 

 Key: HIVE-9736
 URL: https://issues.apache.org/jira/browse/HIVE-9736
 Project: Hive
  Issue Type: Bug
  Components: Metastore, Security
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9736.1.patch, HIVE-9736.2.patch, HIVE-9736.3.patch


 Consider a table partitioned by 2 keys (dt, region). Say a dt partition could 
 have 1 associated regions. Consider that the user does:
 {code:sql}
 ALTER TABLE my_table DROP PARTITION (dt='20150101');
 {code}
 As things stand now, {{StorageBasedAuthProvider}} will make individual 
 {{DistributedFileSystem.listStatus()}} calls for each partition-directory, 
 and authorize each one separately. It'd be faster to batch the calls, and 
 examine multiple FileStatus objects at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9736) StorageBasedAuthProvider should batch namenode-calls where possible.

2015-04-29 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9736:
---
Attachment: HIVE-9736.5.patch

As per [~sushanth]'s suggestion, I've squashed the patches for HIVE-9681 and 
HIVE-9736 into a single one. This should allow the patch to apply to trunk, for 
tests.

(Good idea, Sush.)

 StorageBasedAuthProvider should batch namenode-calls where possible.
 

 Key: HIVE-9736
 URL: https://issues.apache.org/jira/browse/HIVE-9736
 Project: Hive
  Issue Type: Bug
  Components: Metastore, Security
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9736.1.patch, HIVE-9736.2.patch, HIVE-9736.3.patch, 
 HIVE-9736.4.patch, HIVE-9736.5.patch


 Consider a table partitioned by 2 keys (dt, region). Say a dt partition could 
 have 1 associated regions. Consider that the user does:
 {code:sql}
 ALTER TABLE my_table DROP PARTITION (dt='20150101');
 {code}
 As things stand now, {{StorageBasedAuthProvider}} will make individual 
 {{DistributedFileSystem.listStatus()}} calls for each partition-directory, 
 and authorize each one separately. It'd be faster to batch the calls, and 
 examine multiple FileStatus objects at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9736) StorageBasedAuthProvider should batch namenode-calls where possible.

2015-04-29 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9736:
---
Attachment: HIVE-9736.4.patch

The reason this patch didn't apply to trunk/ is that it depends on HIVE-9681. :/

Here's the patch that incorporates [~cnauroth]'s suggestion to combine 
{{FsActions}} into a single instance, to reduce RPCs. I'm afraid that still 
doesn't obviate the overload we added to {{Hadoop*Shims*}} since we needed a 
new overload anyway to pluralize the {{FileStatus}} argument.

The compromise in the patch is to reduce the RPC calls, but keep the overload.

 StorageBasedAuthProvider should batch namenode-calls where possible.
 

 Key: HIVE-9736
 URL: https://issues.apache.org/jira/browse/HIVE-9736
 Project: Hive
  Issue Type: Bug
  Components: Metastore, Security
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9736.1.patch, HIVE-9736.2.patch, HIVE-9736.3.patch, 
 HIVE-9736.4.patch


 Consider a table partitioned by 2 keys (dt, region). Say a dt partition could 
 have 1 associated regions. Consider that the user does:
 {code:sql}
 ALTER TABLE my_table DROP PARTITION (dt='20150101');
 {code}
 As things stand now, {{StorageBasedAuthProvider}} will make individual 
 {{DistributedFileSystem.listStatus()}} calls for each partition-directory, 
 and authorize each one separately. It'd be faster to batch the calls, and 
 examine multiple FileStatus objects at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10492) HCatClient.dropPartitions() should check its partition-spec arguments.

2015-04-26 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-10492:

Attachment: HIVE-10492.1.patch

 HCatClient.dropPartitions() should check its partition-spec arguments.
 --

 Key: HIVE-10492
 URL: https://issues.apache.org/jira/browse/HIVE-10492
 Project: Hive
  Issue Type: Bug
  Components: API, HCatalog
Affects Versions: 1.1.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-10492.1.patch


 {{HCatClient.dropPartitions()}} doesn't check the arguments in the 
 partition-spec. This can lead to a {{RuntimeException}} when partition-keys 
 are specified incorrectly.
 We should check the arguments _a priori_ and throw a descriptive 
 {{IllegalArgumentException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9736) StorageBasedAuthProvider should batch namenode-calls where possible.

2015-05-06 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531885#comment-14531885
 ] 

Mithun Radhakrishnan commented on HIVE-9736:


@[~sushanth]: Quick question about the null-check: If the {{statuses}} are a 
result of {{FileSystem.listStatus(Path[])}}, then I don't see them being null, 
or returning null from {{FileStatus.getPath()}}. I think I might have missed 
the point you made.

 StorageBasedAuthProvider should batch namenode-calls where possible.
 

 Key: HIVE-9736
 URL: https://issues.apache.org/jira/browse/HIVE-9736
 Project: Hive
  Issue Type: Bug
  Components: Metastore, Security
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 1.2.0

 Attachments: HIVE-9736.1.patch, HIVE-9736.2.patch, HIVE-9736.3.patch, 
 HIVE-9736.4.patch, HIVE-9736.5.patch, HIVE-9736.6.patch


 Consider a table partitioned by 2 keys (dt, region). Say a dt partition could 
 have 1 associated regions. Consider that the user does:
 {code:sql}
 ALTER TABLE my_table DROP PARTITION (dt='20150101');
 {code}
 As things stand now, {{StorageBasedAuthProvider}} will make individual 
 {{DistributedFileSystem.listStatus()}} calls for each partition-directory, 
 and authorize each one separately. It'd be faster to batch the calls, and 
 examine multiple FileStatus objects at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9736) StorageBasedAuthProvider should batch namenode-calls where possible.

2015-05-06 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9736:
---
Attachment: HIVE-9736.7.patch

Changed to use the {{Path}} instead of {{FileStatus}}. Re-submitting for tests.

 StorageBasedAuthProvider should batch namenode-calls where possible.
 

 Key: HIVE-9736
 URL: https://issues.apache.org/jira/browse/HIVE-9736
 Project: Hive
  Issue Type: Bug
  Components: Metastore, Security
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 1.2.0

 Attachments: HIVE-9736.1.patch, HIVE-9736.2.patch, HIVE-9736.3.patch, 
 HIVE-9736.4.patch, HIVE-9736.5.patch, HIVE-9736.6.patch, HIVE-9736.7.patch


 Consider a table partitioned by 2 keys (dt, region). Say a dt partition could 
 have 1 associated regions. Consider that the user does:
 {code:sql}
 ALTER TABLE my_table DROP PARTITION (dt='20150101');
 {code}
 As things stand now, {{StorageBasedAuthProvider}} will make individual 
 {{DistributedFileSystem.listStatus()}} calls for each partition-directory, 
 and authorize each one separately. It'd be faster to batch the calls, and 
 examine multiple FileStatus objects at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9736) StorageBasedAuthProvider should batch namenode-calls where possible.

2015-05-06 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531851#comment-14531851
 ] 

Mithun Radhakrishnan commented on HIVE-9736:


[~spena], [~sushanth], thanks for reporting the bug. Sorry for the 
inconvenience. I'll update the patch and see if that sorts things out.

 StorageBasedAuthProvider should batch namenode-calls where possible.
 

 Key: HIVE-9736
 URL: https://issues.apache.org/jira/browse/HIVE-9736
 Project: Hive
  Issue Type: Bug
  Components: Metastore, Security
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 1.2.0

 Attachments: HIVE-9736.1.patch, HIVE-9736.2.patch, HIVE-9736.3.patch, 
 HIVE-9736.4.patch, HIVE-9736.5.patch, HIVE-9736.6.patch


 Consider a table partitioned by 2 keys (dt, region). Say a dt partition could 
 have 1 associated regions. Consider that the user does:
 {code:sql}
 ALTER TABLE my_table DROP PARTITION (dt='20150101');
 {code}
 As things stand now, {{StorageBasedAuthProvider}} will make individual 
 {{DistributedFileSystem.listStatus()}} calls for each partition-directory, 
 and authorize each one separately. It'd be faster to batch the calls, and 
 examine multiple FileStatus objects at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9845) HCatSplit repeats information making input split data size huge

2015-05-05 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529007#comment-14529007
 ] 

Mithun Radhakrishnan commented on HIVE-9845:


Here's the updated patch. Sorry for the delay.

 HCatSplit repeats information making input split data size huge
 ---

 Key: HIVE-9845
 URL: https://issues.apache.org/jira/browse/HIVE-9845
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Reporter: Rohini Palaniswamy
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9845.1.patch, HIVE-9845.3.patch, HIVE-9845.4.patch, 
 HIVE-9845.5.patch


 Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which 
 has even triple the number of splits(100K+ splits and tasks) does not hit 
 that issue.
 {code}
 HCatBaseInputFormat.java:
  //Call getSplit on the InputFormat, create an
   //HCatSplit for each underlying split
   //NumSplits is 0 for our purposes
   org.apache.hadoop.mapred.InputSplit[] baseSplits = 
 inputFormat.getSplits(jobConf, 0);
   for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
 splits.add(new HCatSplit(
 partitionInfo,
 split,allCols));
   }
 {code}
 Each hcatSplit duplicates partition schema and table schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10598) Vectorization borks when column is added to table.

2015-05-04 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527304#comment-14527304
 ] 

Mithun Radhakrishnan commented on HIVE-10598:
-

Tagging Matt, who's the expert on this.

 Vectorization borks when column is added to table.
 --

 Key: HIVE-10598
 URL: https://issues.apache.org/jira/browse/HIVE-10598
 Project: Hive
  Issue Type: Bug
  Components: Vectorization
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan

 Consider the following table definition:
 {code:sql}
 create table foobar ( foo string, bar string ) partitioned by (dt string) 
 stored as orc;
 alter table foobar add partition( dt='20150101' ) ;
 {code}
 Say the partition has the following data:
 {noformat}
 1 one 20150101
 2 two 20150101
 3 three   20150101
 {noformat}
 If a new column is added to the table-schema (and the partition continues to 
 have the old schema), vectorized read from the old partitions fail thus:
 {code:sql}
 alter table foobar add columns( goo string );
 select count(1) from foobar;
 {code}
 {code:title=stacktrace}
 java.lang.Exception: java.lang.RuntimeException: Error creating a batch
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
 Caused by: java.lang.RuntimeException: Error creating a batch
   at 
 org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:114)
   at 
 org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:52)
   at 
 org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.createValue(CombineHiveRecordReader.java:84)
   at 
 org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.createValue(CombineHiveRecordReader.java:42)
   at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.createValue(HadoopShimsSecure.java:156)
   at 
 org.apache.hadoop.mapred.MapTask$TrackedRecordReader.createValue(MapTask.java:180)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: No type entry 
 found for column 3 in map {4=Long}
   at 
 org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.addScratchColumnsToBatch(VectorizedRowBatchCtx.java:632)
   at 
 org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.createVectorizedRowBatch(VectorizedRowBatchCtx.java:343)
   at 
 org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:112)
   ... 14 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10598) Vectorization borks when column is added to table.

2015-05-04 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527529#comment-14527529
 ] 

Mithun Radhakrishnan commented on HIVE-10598:
-

Hey, Gopal. Thanks for looking at this. I'm afraid I'm seeing this on trunk and 
on YHive 0.13 (which is very close to 0.14). The stack trace is for trunk. 

 Vectorization borks when column is added to table.
 --

 Key: HIVE-10598
 URL: https://issues.apache.org/jira/browse/HIVE-10598
 Project: Hive
  Issue Type: Bug
  Components: Vectorization
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan

 Consider the following table definition:
 {code:sql}
 create table foobar ( foo string, bar string ) partitioned by (dt string) 
 stored as orc;
 alter table foobar add partition( dt='20150101' ) ;
 {code}
 Say the partition has the following data:
 {noformat}
 1 one 20150101
 2 two 20150101
 3 three   20150101
 {noformat}
 If a new column is added to the table-schema (and the partition continues to 
 have the old schema), vectorized read from the old partitions fail thus:
 {code:sql}
 alter table foobar add columns( goo string );
 select count(1) from foobar;
 {code}
 {code:title=stacktrace}
 java.lang.Exception: java.lang.RuntimeException: Error creating a batch
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
 Caused by: java.lang.RuntimeException: Error creating a batch
   at 
 org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:114)
   at 
 org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:52)
   at 
 org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.createValue(CombineHiveRecordReader.java:84)
   at 
 org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.createValue(CombineHiveRecordReader.java:42)
   at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.createValue(HadoopShimsSecure.java:156)
   at 
 org.apache.hadoop.mapred.MapTask$TrackedRecordReader.createValue(MapTask.java:180)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: No type entry 
 found for column 3 in map {4=Long}
   at 
 org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.addScratchColumnsToBatch(VectorizedRowBatchCtx.java:632)
   at 
 org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.createVectorizedRowBatch(VectorizedRowBatchCtx.java:343)
   at 
 org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:112)
   ... 14 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9736) StorageBasedAuthProvider should batch namenode-calls where possible.

2015-05-04 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9736:
---
Attachment: HIVE-9736.6.patch

Rebased for the master branch.

 StorageBasedAuthProvider should batch namenode-calls where possible.
 

 Key: HIVE-9736
 URL: https://issues.apache.org/jira/browse/HIVE-9736
 Project: Hive
  Issue Type: Bug
  Components: Metastore, Security
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9736.1.patch, HIVE-9736.2.patch, HIVE-9736.3.patch, 
 HIVE-9736.4.patch, HIVE-9736.5.patch, HIVE-9736.6.patch


 Consider a table partitioned by 2 keys (dt, region). Say a dt partition could 
 have 1 associated regions. Consider that the user does:
 {code:sql}
 ALTER TABLE my_table DROP PARTITION (dt='20150101');
 {code}
 As things stand now, {{StorageBasedAuthProvider}} will make individual 
 {{DistributedFileSystem.listStatus()}} calls for each partition-directory, 
 and authorize each one separately. It'd be faster to batch the calls, and 
 examine multiple FileStatus objects at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9845) HCatSplit repeats information making input split data size huge

2015-05-05 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528830#comment-14528830
 ] 

Mithun Radhakrishnan commented on HIVE-9845:


I'll upload a new patch shortly.

 HCatSplit repeats information making input split data size huge
 ---

 Key: HIVE-9845
 URL: https://issues.apache.org/jira/browse/HIVE-9845
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Reporter: Rohini Palaniswamy
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9845.1.patch, HIVE-9845.3.patch, HIVE-9845.4.patch


 Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which 
 has even triple the number of splits(100K+ splits and tasks) does not hit 
 that issue.
 {code}
 HCatBaseInputFormat.java:
  //Call getSplit on the InputFormat, create an
   //HCatSplit for each underlying split
   //NumSplits is 0 for our purposes
   org.apache.hadoop.mapred.InputSplit[] baseSplits = 
 inputFormat.getSplits(jobConf, 0);
   for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
 splits.add(new HCatSplit(
 partitionInfo,
 split,allCols));
   }
 {code}
 Each hcatSplit duplicates partition schema and table schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9736) StorageBasedAuthProvider should batch namenode-calls where possible.

2015-05-04 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527314#comment-14527314
 ] 

Mithun Radhakrishnan commented on HIVE-9736:


Thanks for the review, Chris. 

Would it be alright if we moved the {{combine()}} code to a common place as 
part of a separate JIRA? I didn't do this here because both call-sites are in 
different packages, and adding a dependency would be involved.

 StorageBasedAuthProvider should batch namenode-calls where possible.
 

 Key: HIVE-9736
 URL: https://issues.apache.org/jira/browse/HIVE-9736
 Project: Hive
  Issue Type: Bug
  Components: Metastore, Security
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9736.1.patch, HIVE-9736.2.patch, HIVE-9736.3.patch, 
 HIVE-9736.4.patch, HIVE-9736.5.patch


 Consider a table partitioned by 2 keys (dt, region). Say a dt partition could 
 have 1 associated regions. Consider that the user does:
 {code:sql}
 ALTER TABLE my_table DROP PARTITION (dt='20150101');
 {code}
 As things stand now, {{StorageBasedAuthProvider}} will make individual 
 {{DistributedFileSystem.listStatus()}} calls for each partition-directory, 
 and authorize each one separately. It'd be faster to batch the calls, and 
 examine multiple FileStatus objects at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9845) HCatSplit repeats information making input split data size huge

2015-05-05 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9845:
---
Attachment: HIVE-9845.5.patch

 HCatSplit repeats information making input split data size huge
 ---

 Key: HIVE-9845
 URL: https://issues.apache.org/jira/browse/HIVE-9845
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Reporter: Rohini Palaniswamy
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9845.1.patch, HIVE-9845.3.patch, HIVE-9845.4.patch, 
 HIVE-9845.5.patch


 Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which 
 has even triple the number of splits(100K+ splits and tasks) does not hit 
 that issue.
 {code}
 HCatBaseInputFormat.java:
  //Call getSplit on the InputFormat, create an
   //HCatSplit for each underlying split
   //NumSplits is 0 for our purposes
   org.apache.hadoop.mapred.InputSplit[] baseSplits = 
 inputFormat.getSplits(jobConf, 0);
   for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
 splits.add(new HCatSplit(
 partitionInfo,
 split,allCols));
   }
 {code}
 Each hcatSplit duplicates partition schema and table schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10304) Add deprecation message to HiveCLI

2015-04-16 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498900#comment-14498900
 ] 

Mithun Radhakrishnan commented on HIVE-10304:
-

Hello, [~szehon]. Pardon the delay; I wish I'd responded sooner.

There was a discussion on the dev user-list that concluded that we shouldn't be 
deprecating the Hive command-line until we have interface/error-code parity 
between beeline and the CLI. 
[Here|http://mail-archives.apache.org/mod_mbox/hive-dev/201412.mbox/%3ccabgngzfnjhnfv0p15+glmznf-gogw6dm9xotgoqh+dnyg3z...@mail.gmail.com%3E]
 is one thread. To quote:

bq. +1 to the idea of embedding beeline within hive cli, and retaining core 
behavior such as exit codes in hive-cli while doing that... users don't have to 
specify parameters like jdbc url, username etc.

The issue I see here is that there are still Hive installations that depend on 
the CLI, and don't depend entirely on HS2 deploys. (Where I work, for 
instance.) I'd be very keen to see the embedded-beeline option in working order.

Could we please discuss this check-in? I don't know if it's a good idea to push 
this into the impending release. I fear that the deprecation will be too 
disruptive, without proper recourse.

 Add deprecation message to HiveCLI
 --

 Key: HIVE-10304
 URL: https://issues.apache.org/jira/browse/HIVE-10304
 Project: Hive
  Issue Type: Improvement
  Components: CLI
Affects Versions: 1.1.0
Reporter: Szehon Ho
Assignee: Szehon Ho
  Labels: TODOC1.2
 Fix For: 1.2.0

 Attachments: HIVE-10304.2.patch, HIVE-10304.3.patch, HIVE-10304.patch


 As Beeline is now the recommended command line tool to Hive, we should add a 
 message to HiveCLI to indicate that it is deprecated and redirect them to 
 Beeline.  
 This is not suggesting to remove HiveCLI for now, but just a helpful 
 direction for user to know the direction to focus attention in Beeline.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9845) HCatSplit repeats information making input split data size huge

2015-04-11 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491235#comment-14491235
 ] 

Mithun Radhakrishnan commented on HIVE-9845:


The failures are unrelated to the code-change.

 HCatSplit repeats information making input split data size huge
 ---

 Key: HIVE-9845
 URL: https://issues.apache.org/jira/browse/HIVE-9845
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Reporter: Rohini Palaniswamy
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9845.1.patch, HIVE-9845.3.patch, HIVE-9845.4.patch


 Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which 
 has even triple the number of splits(100K+ splits and tasks) does not hit 
 that issue.
 {code}
 HCatBaseInputFormat.java:
  //Call getSplit on the InputFormat, create an
   //HCatSplit for each underlying split
   //NumSplits is 0 for our purposes
   org.apache.hadoop.mapred.InputSplit[] baseSplits = 
 inputFormat.getSplits(jobConf, 0);
   for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
 splits.add(new HCatSplit(
 partitionInfo,
 split,allCols));
   }
 {code}
 Each hcatSplit duplicates partition schema and table schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10227) Concrete implementation of Export/Import based ReplicationTaskFactory

2015-04-18 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14501662#comment-14501662
 ] 

Mithun Radhakrishnan commented on HIVE-10227:
-

+0.9. I have some nitpicks. I've made mention on the [Review 
Board|https://reviews.apache.org/r/7/]. 

 Concrete implementation of Export/Import based ReplicationTaskFactory
 -

 Key: HIVE-10227
 URL: https://issues.apache.org/jira/browse/HIVE-10227
 Project: Hive
  Issue Type: Sub-task
  Components: Import/Export
Affects Versions: 1.2.0
Reporter: Sushanth Sowmyan
Assignee: Sushanth Sowmyan
 Attachments: HIVE-10227.2.patch, HIVE-10227.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10213) MapReduce jobs using dynamic-partitioning fail on commit.

2015-04-03 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-10213:

Attachment: HIVE-10213.1.patch

 MapReduce jobs using dynamic-partitioning fail on commit.
 -

 Key: HIVE-10213
 URL: https://issues.apache.org/jira/browse/HIVE-10213
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-10213.1.patch


 I recently ran into a problem in {{TaskCommitContextRegistry}}, when using 
 dynamic-partitions.
 Consider a MapReduce program that reads HCatRecords from a table (using 
 HCatInputFormat), and then writes to another table (with identical schema), 
 using HCatOutputFormat. The Map-task fails with the following exception:
 {code}
 Error: java.io.IOException: No callback registered for 
 TaskAttemptID:attempt_1426589008676_509707_m_00_0@hdfs://crystalmyth.myth.net:8020/user/mithunr/mythdb/target/_DYN0.6784154320609959/grid=__HIVE_DEFAULT_PARTITION__/dt=__HIVE_DEFAULT_PARTITION__
 at 
 org.apache.hive.hcatalog.mapreduce.TaskCommitContextRegistry.commitTask(TaskCommitContextRegistry.java:56)
 at 
 org.apache.hive.hcatalog.mapreduce.FileOutputCommitterContainer.commitTask(FileOutputCommitterContainer.java:139)
 at org.apache.hadoop.mapred.Task.commit(Task.java:1163)
 at org.apache.hadoop.mapred.Task.done(Task.java:1025)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:345)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
 {code}
 {{TaskCommitContextRegistry::commitTask()}} uses call-backs registered from 
 {{DynamicPartitionFileRecordWriter}}. But in case {{HCatInputFormat}} and 
 {{HCatOutputFormat}} are both used in the same job, the 
 {{DynamicPartitionFileRecordWriter}} might only be exercised in the Reducer.
 I'm relaxing the IOException, and log a warning message instead of just 
 failing.
 (I'll post the fix shortly.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9609) AddPartitionMessage.getPartitions() can return null

2015-04-07 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484171#comment-14484171
 ] 

Mithun Radhakrishnan commented on HIVE-9609:


@[~sushanth]: 

1-2. I agree with you, and hence, me again. (?!) {{ListListPartVal}} might 
be doable, but we can hit that with a separate JIRA. The rest of the iterator 
stuff is pretty neat. I'll read through the updated patch more closely before 
+1-ing.
3. That was likely my (IDE's) doing. Much obliged, and many simultaneous 
apologies.

I had recommended a change to 
{{AuthorizationPreEventListener.authorizeAddPartition}} to use the alternative 
{{PartitionWrapper}} constructor. (It's way faster.) But again, it's possible 
that that change distracts from our objective here. Separate JIRA?

 AddPartitionMessage.getPartitions() can return null
 ---

 Key: HIVE-9609
 URL: https://issues.apache.org/jira/browse/HIVE-9609
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Sushanth Sowmyan
Assignee: Sushanth Sowmyan
 Attachments: HIVE-9609.2.patch, HIVE-9609.3.patch, HIVE-9609.patch


 DbNotificationListener and NotificationListener both depend on 
 AddPartitionEvent.getPartitions() to get their partitions to trigger a 
 message, but this can be null if an AddPartitionEvent was initialized on a 
 PartitionSpec rather than a ListPartition.
 Also, AddPartitionEvent seems to have a duality, where getPartitions() works 
 only if instantiated on a ListPartition, and getPartitionIterator() works 
 only if instantiated on a PartitionSpec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HIVE-10250) Optimize AuthorizationPreEventListener to reuse TableWrapper objects

2015-04-08 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan reassigned HIVE-10250:
---

Assignee: Mithun Radhakrishnan

 Optimize AuthorizationPreEventListener to reuse TableWrapper objects
 

 Key: HIVE-10250
 URL: https://issues.apache.org/jira/browse/HIVE-10250
 Project: Hive
  Issue Type: Bug
  Components: Authorization
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-10250.1.patch


 Here's the {{PartitionWrapper}} class in {{AuthorizationPreEventListener}}:
 {code:java|title=AuthorizationPreEventListener.java}
  public static class PartitionWrapper extends 
 org.apache.hadoop.hive.ql.metadata.Partition {
 ...
 public PartitionWrapper(org.apache.hadoop.hive.metastore.api.Partition 
 mapiPart, PreEventContext context) throws ... {
  Partition wrapperApiPart   = mapiPart.deepCopy();
  Table t = context.getHandler().get_table_core(
  mapiPart.getDbName(), 
  mapiPart.getTableName());
 ...
 }
 {code}
 {{PreAddPartitionEvent}} (and soon, {{PreDropPartitionEvent}}) correspond not 
 just to a single partition, but an entire set of partitions added atomically. 
 When the event is authorized, {{HMSHandler.get_table_core()}} will be called 
 once for every partition in the Event instance.
 Since we already make the assumption that the partition-sets correspond to a 
 single table, we might as well make a single call.
 I'll have a patch for this, shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9674) *DropPartitionEvent should handle partition-sets.

2015-04-08 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486145#comment-14486145
 ] 

Mithun Radhakrishnan commented on HIVE-9674:


Actually, [~sushanth], let's hold off for right now, on this one. I'll rebase 
this under the assumption that HIVE-9609 is good to go.

 *DropPartitionEvent should handle partition-sets.
 -

 Key: HIVE-9674
 URL: https://issues.apache.org/jira/browse/HIVE-9674
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9674.2.patch


 Dropping a set of N partitions from a table currently results in N 
 DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
 is wasteful, especially so for large N. It also makes it impossible to even 
 try to run authorization-checks on all partitions in a batch.
 Taking the cue from HIVE-9609, we should compose an {{IterablePartition}} 
 in the event, and expose them via an {{Iterator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9674) *DropPartitionEvent should handle partition-sets.

2015-04-08 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9674:
---
Attachment: HIVE-9674.3.patch

Rebased to accommodate HIVE-9609.

 *DropPartitionEvent should handle partition-sets.
 -

 Key: HIVE-9674
 URL: https://issues.apache.org/jira/browse/HIVE-9674
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9674.2.patch, HIVE-9674.3.patch


 Dropping a set of N partitions from a table currently results in N 
 DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
 is wasteful, especially so for large N. It also makes it impossible to even 
 try to run authorization-checks on all partitions in a batch.
 Taking the cue from HIVE-9609, we should compose an {{IterablePartition}} 
 in the event, and expose them via an {{Iterator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9674) *DropPartitionEvent should handle partition-sets.

2015-04-08 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9674:
---
Attachment: (was: HIVE-9736.3.patch)

 *DropPartitionEvent should handle partition-sets.
 -

 Key: HIVE-9674
 URL: https://issues.apache.org/jira/browse/HIVE-9674
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9674.2.patch


 Dropping a set of N partitions from a table currently results in N 
 DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
 is wasteful, especially so for large N. It also makes it impossible to even 
 try to run authorization-checks on all partitions in a batch.
 Taking the cue from HIVE-9609, we should compose an {{IterablePartition}} 
 in the event, and expose them via an {{Iterator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9674) *DropPartitionEvent should handle partition-sets.

2015-04-08 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9674:
---
Attachment: (was: HIVE-9736.4.patch)

 *DropPartitionEvent should handle partition-sets.
 -

 Key: HIVE-9674
 URL: https://issues.apache.org/jira/browse/HIVE-9674
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9674.2.patch


 Dropping a set of N partitions from a table currently results in N 
 DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
 is wasteful, especially so for large N. It also makes it impossible to even 
 try to run authorization-checks on all partitions in a batch.
 Taking the cue from HIVE-9609, we should compose an {{IterablePartition}} 
 in the event, and expose them via an {{Iterator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9736) StorageBasedAuthProvider should batch namenode-calls where possible.

2015-04-08 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486260#comment-14486260
 ] 

Mithun Radhakrishnan commented on HIVE-9736:


@[~cnauroth]: Good to meet you, sir. I'd value your input on this change, given 
that you've worked on the SBAP already.

bq. Great ideas in this patch!
Aww, shucks... You're only saying that because it's true. ;p 

I should have a rebased version for you shortly. I'd better sort HIVE-9674 out 
first.

 StorageBasedAuthProvider should batch namenode-calls where possible.
 

 Key: HIVE-9736
 URL: https://issues.apache.org/jira/browse/HIVE-9736
 Project: Hive
  Issue Type: Bug
  Components: Metastore, Security
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9736.1.patch


 Consider a table partitioned by 2 keys (dt, region). Say a dt partition could 
 have 1 associated regions. Consider that the user does:
 {code:sql}
 ALTER TABLE my_table DROP PARTITION (dt='20150101');
 {code}
 As things stand now, {{StorageBasedAuthProvider}} will make individual 
 {{DistributedFileSystem.listStatus()}} calls for each partition-directory, 
 and authorize each one separately. It'd be faster to batch the calls, and 
 examine multiple FileStatus objects at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10761) Create codahale-based metrics system for Hive

2015-06-04 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573613#comment-14573613
 ] 

Mithun Radhakrishnan commented on HIVE-10761:
-

Hey, Sush, Szehon. I can confirm that Yahoo cares about HS2 metrics. :p 

I'm not familiar with codehale, but if it works with JMX, that's cool. Lemme do 
some homework. Thanks for the heads-up and the nifty addition, chaps.

 Create codahale-based metrics system for Hive
 -

 Key: HIVE-10761
 URL: https://issues.apache.org/jira/browse/HIVE-10761
 Project: Hive
  Issue Type: New Feature
  Components: Diagnosability
Reporter: Szehon Ho
Assignee: Szehon Ho
 Fix For: 1.3.0

 Attachments: HIVE-10761.2.patch, HIVE-10761.3.patch, 
 HIVE-10761.4.patch, HIVE-10761.5.patch, HIVE-10761.6.patch, HIVE-10761.patch, 
 hms-metrics.json


 There is a current Hive metrics system that hooks up to a JMX reporting, but 
 all its measurements, models are custom.
 This is to make another metrics system that will be based on Codahale (ie 
 yammer, dropwizard), which has the following advantage:
 * Well-defined metric model for frequently-needed metrics (ie JVM metrics)
 * Well-defined measurements for all metrics (ie max, mean, stddev, mean_rate, 
 etc), 
 * Built-in reporting frameworks like JMX, Console, Log, JSON webserver
 It is used for many projects, including several Apache projects like Oozie.  
 Overall, monitoring tools should find it easier to understand these common 
 metric, measurement, reporting models.
 The existing metric subsystem will be kept and can be enabled if backward 
 compatibility is desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10761) Create codahale-based metrics system for Hive

2015-06-04 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573617#comment-14573617
 ] 

Mithun Radhakrishnan commented on HIVE-10761:
-

Question: Are we proposing to deprecate the old metrics system on trunk? What 
release are we considering deprecation and removal?

 Create codahale-based metrics system for Hive
 -

 Key: HIVE-10761
 URL: https://issues.apache.org/jira/browse/HIVE-10761
 Project: Hive
  Issue Type: New Feature
  Components: Diagnosability
Reporter: Szehon Ho
Assignee: Szehon Ho
 Fix For: 1.3.0

 Attachments: HIVE-10761.2.patch, HIVE-10761.3.patch, 
 HIVE-10761.4.patch, HIVE-10761.5.patch, HIVE-10761.6.patch, HIVE-10761.patch, 
 hms-metrics.json


 There is a current Hive metrics system that hooks up to a JMX reporting, but 
 all its measurements, models are custom.
 This is to make another metrics system that will be based on Codahale (ie 
 yammer, dropwizard), which has the following advantage:
 * Well-defined metric model for frequently-needed metrics (ie JVM metrics)
 * Well-defined measurements for all metrics (ie max, mean, stddev, mean_rate, 
 etc), 
 * Built-in reporting frameworks like JMX, Console, Log, JSON webserver
 It is used for many projects, including several Apache projects like Oozie.  
 Overall, monitoring tools should find it easier to understand these common 
 metric, measurement, reporting models.
 The existing metric subsystem will be kept and can be enabled if backward 
 compatibility is desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10754) Pig+Hcatalog doesn't work properly since we need to clone the Job instance in HCatLoader

2015-06-04 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573596#comment-14573596
 ] 

Mithun Radhakrishnan commented on HIVE-10754:
-

I see what we're trying to achieve, but I still need help understanding how 
this change fixes the problem. (Sorry. :/) 

Here's the relevant code from {{Job.java}} from Hadoop 2.6.

{code:java|title=Job.java|borderStyle=solid|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
  @Deprecated
  public Job(Configuration conf) throws IOException {
this(new JobConf(conf));
  }

  Job(JobConf conf) throws IOException {
super(conf, null);
// propagate existing user credentials to job
this.credentials.mergeAll(this.ugi.getCredentials());
this.cluster = null;
  }

 public static Job getInstance(Configuration conf) throws IOException {
// create with a null Cluster
JobConf jobConf = new JobConf(conf);
return new Job(jobConf);
  }
{code}

# The current implementation of {{HCatLoader.setLocation()}} calls {{new Job( 
Configuration )}}, which clones the {{JobConf}} inline and calls the private 
constructor {{Job(JobConf)}}.
# Your improved implementation of {{HCatLoader.setLocation()}} calls 
{{Job.getInstance()}}. This method clones the {{JobConf}} explicitly, and then 
calls the private constructor {{Job(jobConf)}}.

bq. These two are different (JobConf is not cloned when we call new Job(conf)).
Both of these seem identical in effect to me. :/ There's no way for 
{{HCatLoader.setLocation()}} to call the {{Job(JobConf)}} constructor, because 
it's package-private, right?


 Pig+Hcatalog doesn't work properly since we need to clone the Job instance in 
 HCatLoader
 

 Key: HIVE-10754
 URL: https://issues.apache.org/jira/browse/HIVE-10754
 Project: Hive
  Issue Type: Sub-task
  Components: HCatalog
Affects Versions: 1.2.0
Reporter: Aihua Xu
Assignee: Aihua Xu
 Attachments: HIVE-10754.patch


 {noformat}
 Create table tbl1 (key string, value string) stored as rcfile;
 Create table tbl2 (key string, value string);
 insert into tbl1 values( '1', '111');
 insert into tbl2 values('1', '2');
 {noformat}
 Pig script:
 {noformat}
 src_tbl1 = FILTER tbl1 BY (key == '1');
 prj_tbl1 = FOREACH src_tbl1 GENERATE
key as tbl1_key,
value as tbl1_value,
'333' as tbl1_v1;

 src_tbl2 = FILTER tbl2 BY (key == '1');
 prj_tbl2 = FOREACH src_tbl2 GENERATE
key as tbl2_key,
value as tbl2_value;

 dump prj_tbl1;
 dump prj_tbl2;
 result = JOIN prj_tbl1 BY (tbl1_key), prj_tbl2 BY (tbl2_key);
 prj_result = FOREACH result 
   GENERATE  prj_tbl1::tbl1_key AS key1,
 prj_tbl1::tbl1_value AS value1,
 prj_tbl1::tbl1_v1 AS v1,
 prj_tbl2::tbl2_key AS key2,
 prj_tbl2::tbl2_value AS value2;

 dump prj_result;
 {noformat}
 The expected result is (1,111,333,1,2) while the result is (1,2,333,1,2).  We 
 need to clone the job instance in HCatLoader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10754) Pig+Hcatalog doesn't work properly since we need to clone the Job instance in HCatLoader

2015-06-04 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573232#comment-14573232
 ] 

Mithun Radhakrishnan commented on HIVE-10754:
-

Hello, Aihua. I'm all for switching from the deprecated {{Job}} constructor to 
using {{Job.getInstance()}}.

But I am unable to understand how this changes/fixes anything. Both {{new 
Job(Configuration)}} and {{Job.getInstance(Configuration)}} seem to eventually 
use the package-private {{Job(JobConf)}} constructor. No latter references to 
{{clone}} or {{job}} have been modified in {{HCatLoader.setLocation()}}.

Could you please explain your intention?

 Pig+Hcatalog doesn't work properly since we need to clone the Job instance in 
 HCatLoader
 

 Key: HIVE-10754
 URL: https://issues.apache.org/jira/browse/HIVE-10754
 Project: Hive
  Issue Type: Sub-task
  Components: HCatalog
Affects Versions: 1.2.0
Reporter: Aihua Xu
Assignee: Aihua Xu
 Attachments: HIVE-10754.patch


 {noformat}
 Create table tbl1 (key string, value string) stored as rcfile;
 Create table tbl2 (key string, value string);
 insert into tbl1 values( '1', '111');
 insert into tbl2 values('1', '2');
 {noformat}
 Pig script:
 {noformat}
 src_tbl1 = FILTER tbl1 BY (key == '1');
 prj_tbl1 = FOREACH src_tbl1 GENERATE
key as tbl1_key,
value as tbl1_value,
'333' as tbl1_v1;

 src_tbl2 = FILTER tbl2 BY (key == '1');
 prj_tbl2 = FOREACH src_tbl2 GENERATE
key as tbl2_key,
value as tbl2_value;

 dump prj_tbl1;
 dump prj_tbl2;
 result = JOIN prj_tbl1 BY (tbl1_key), prj_tbl2 BY (tbl2_key);
 prj_result = FOREACH result 
   GENERATE  prj_tbl1::tbl1_key AS key1,
 prj_tbl1::tbl1_value AS value1,
 prj_tbl1::tbl1_v1 AS v1,
 prj_tbl2::tbl2_key AS key2,
 prj_tbl2::tbl2_value AS value2;

 dump prj_result;
 {noformat}
 The expected result is (1,111,333,1,2) while the result is (1,2,333,1,2).  We 
 need to clone the job instance in HCatLoader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10752) Revert HIVE-5193

2015-06-01 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567892#comment-14567892
 ] 

Mithun Radhakrishnan commented on HIVE-10752:
-

Yes, of course. +1, as per 
[HIVE-10720|https://issues.apache.org/jira/browse/HIVE-10720?focusedCommentId=14565768page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14565768].

Let's circle back, after Viraj and I have identified why this isn't a problem 
with our internal Hive 0.13-0.14 branch. 

 Revert HIVE-5193
 

 Key: HIVE-10752
 URL: https://issues.apache.org/jira/browse/HIVE-10752
 Project: Hive
  Issue Type: Sub-task
  Components: HCatalog
Affects Versions: 1.2.0
Reporter: Aihua Xu
Assignee: Aihua Xu
 Attachments: HIVE-10752.patch


 Revert HIVE-5193 since it causes pig+hcatalog not working.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10720) Pig using HCatLoader to access RCFile and perform join but get incorrect result.

2015-05-29 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14565594#comment-14565594
 ] 

Mithun Radhakrishnan commented on HIVE-10720:
-

:] Frustration aside, I'm completely open to reverting the patch if it's the 
right thing to do. (Incorrect results are a critical bug.) We're trying to make 
sure that we won't have to revert the revert.

Viraj has confirmed that there was a bug in his patch. He's uploading a new one 
shortly. If this doesn't sort out the issue you're facing, let's revert and 
postpone debate to a later time.

 Pig using HCatLoader to access RCFile and perform join but get incorrect 
 result.
 

 Key: HIVE-10720
 URL: https://issues.apache.org/jira/browse/HIVE-10720
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Affects Versions: 1.3.0
Reporter: Aihua Xu
Assignee: Aihua Xu
 Attachments: HIVE-10720.patch


 {noformat}
 Create table tbl1 (key string, value string) stored as rcfile;
 Create table tbl2 (key string, value string);
 insert into tbl1 values('1', 'value1');
 insert into tbl2 values('1', 'value2');
 {noformat}
 Pig script:
 {noformat}
 tbl1 = LOAD 'tbl1' USING org.apache.hive.hcatalog.pig.HCatLoader();
 tbl2 = LOAD 'tbl2' USING org.apache.hive.hcatalog.pig.HCatLoader();
 src_tbl1 = FILTER tbl1 BY (key == '1');
 prj_tbl1 = FOREACH src_tbl1 GENERATE
key as tbl1_key,
value as tbl1_value,
'333' as tbl1_v1;

 src_tbl2 = FILTER tbl2 BY (key == '1');
 prj_tbl2 = FOREACH src_tbl2 GENERATE
key as tbl2_key,
value as tbl2_value;

 result = JOIN prj_tbl1 BY (tbl1_key), prj_tbl2 BY (tbl2_key);
 prj_result = FOREACH result 
   GENERATE  prj_tbl1::tbl1_key AS key1,
 prj_tbl1::tbl1_value AS value1,
 prj_tbl1::tbl1_v1 AS v1,
 prj_tbl2::tbl2_key AS key2,
 prj_tbl2::tbl2_value AS value2;

 dump prj_result;
 {noformat}
 We could see different invalid results or even no result which should return.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10720) Pig using HCatLoader to access RCFile and perform join but get incorrect result.

2015-05-29 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14565768#comment-14565768
 ] 

Mithun Radhakrishnan commented on HIVE-10720:
-

Ok. Looks like I have held you up long enough. If you've verified that this 
code path works without HIVE-5193, let's roll it back, and revisit this fix in 
a separate JIRA. We will try identify how this works correctly on our internal 
branch. Viral, does that sound ok?

Sorry for the delay. I applaud your diligence and patience, Aihua. Thank you. :]

 Pig using HCatLoader to access RCFile and perform join but get incorrect 
 result.
 

 Key: HIVE-10720
 URL: https://issues.apache.org/jira/browse/HIVE-10720
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Affects Versions: 1.3.0
Reporter: Aihua Xu
Assignee: Aihua Xu
 Attachments: HIVE-10720.patch


 {noformat}
 Create table tbl1 (key string, value string) stored as rcfile;
 Create table tbl2 (key string, value string);
 insert into tbl1 values('1', 'value1');
 insert into tbl2 values('1', 'value2');
 {noformat}
 Pig script:
 {noformat}
 tbl1 = LOAD 'tbl1' USING org.apache.hive.hcatalog.pig.HCatLoader();
 tbl2 = LOAD 'tbl2' USING org.apache.hive.hcatalog.pig.HCatLoader();
 src_tbl1 = FILTER tbl1 BY (key == '1');
 prj_tbl1 = FOREACH src_tbl1 GENERATE
key as tbl1_key,
value as tbl1_value,
'333' as tbl1_v1;

 src_tbl2 = FILTER tbl2 BY (key == '1');
 prj_tbl2 = FOREACH src_tbl2 GENERATE
key as tbl2_key,
value as tbl2_value;

 result = JOIN prj_tbl1 BY (tbl1_key), prj_tbl2 BY (tbl2_key);
 prj_result = FOREACH result 
   GENERATE  prj_tbl1::tbl1_key AS key1,
 prj_tbl1::tbl1_value AS value1,
 prj_tbl1::tbl1_v1 AS v1,
 prj_tbl2::tbl2_key AS key2,
 prj_tbl2::tbl2_value AS value2;

 dump prj_result;
 {noformat}
 We could see different invalid results or even no result which should return.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10752) Revert HIVE-5193

2015-05-28 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563191#comment-14563191
 ] 

Mithun Radhakrishnan commented on HIVE-10752:
-

[~aihuaxu], doesn't the patch that [~viraj] posted on HIVE-10720 sort this out?

 Revert HIVE-5193
 

 Key: HIVE-10752
 URL: https://issues.apache.org/jira/browse/HIVE-10752
 Project: Hive
  Issue Type: Sub-task
  Components: HCatalog
Affects Versions: 1.2.0
Reporter: Aihua Xu
Assignee: Aihua Xu
 Attachments: HIVE-10752.patch


 Revert HIVE-5193 since it causes pig+hcatalog not working.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10752) Revert HIVE-5193

2015-05-29 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14565401#comment-14565401
 ] 

Mithun Radhakrishnan commented on HIVE-10752:
-


bq. Given that HIVE-5193 broke some functionality and it was just for columnar 
table performance improvement, in addition that patch provided in HIVE-10720 
did still not solve the issue.

While I agree that HIVE-5193 did introduce a bug, I can't yet agree that we 
should revert it. [~viraj] is currently testing whether the one-liner posted in 
HIVE-10720 doesn't resolve the issue. (My understanding was that this does.) 
I'll let him confirm shortly.

In the meantime, please consider that the fix 
({{ColumnProjectionUtils.setReadColumnIDs(job.getConfiguration(), null);}}) is 
only applied when {{requiredFieldsInfo == null}}, which is shorthand for Pig 
requiring all columns. So the deserialization is not done in all cases. It's 
only for when all fields are required. There isn't any loss of performance in 
this case.

Am I missing something?

 Revert HIVE-5193
 

 Key: HIVE-10752
 URL: https://issues.apache.org/jira/browse/HIVE-10752
 Project: Hive
  Issue Type: Sub-task
  Components: HCatalog
Affects Versions: 1.2.0
Reporter: Aihua Xu
Assignee: Aihua Xu
 Attachments: HIVE-10752.patch


 Revert HIVE-5193 since it causes pig+hcatalog not working.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10720) Pig using HCatLoader to access RCFile and perform join but get incorrect result.

2015-05-29 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14565426#comment-14565426
 ] 

Mithun Radhakrishnan commented on HIVE-10720:
-

Hey, [~aihuaxu]. Could you please post a stack-trace for the NPE?

 Pig using HCatLoader to access RCFile and perform join but get incorrect 
 result.
 

 Key: HIVE-10720
 URL: https://issues.apache.org/jira/browse/HIVE-10720
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Affects Versions: 1.3.0
Reporter: Aihua Xu
Assignee: Aihua Xu
 Attachments: HIVE-10720.patch


 {noformat}
 Create table tbl1 (key string, value string) stored as rcfile;
 Create table tbl2 (key string, value string);
 insert into tbl1 values('1', 'value1');
 insert into tbl2 values('1', 'value2');
 {noformat}
 Pig script:
 {noformat}
 tbl1 = LOAD 'tbl1' USING org.apache.hive.hcatalog.pig.HCatLoader();
 tbl2 = LOAD 'tbl2' USING org.apache.hive.hcatalog.pig.HCatLoader();
 src_tbl1 = FILTER tbl1 BY (key == '1');
 prj_tbl1 = FOREACH src_tbl1 GENERATE
key as tbl1_key,
value as tbl1_value,
'333' as tbl1_v1;

 src_tbl2 = FILTER tbl2 BY (key == '1');
 prj_tbl2 = FOREACH src_tbl2 GENERATE
key as tbl2_key,
value as tbl2_value;

 result = JOIN prj_tbl1 BY (tbl1_key), prj_tbl2 BY (tbl2_key);
 prj_result = FOREACH result 
   GENERATE  prj_tbl1::tbl1_key AS key1,
 prj_tbl1::tbl1_value AS value1,
 prj_tbl1::tbl1_v1 AS v1,
 prj_tbl2::tbl2_key AS key2,
 prj_tbl2::tbl2_value AS value2;

 dump prj_result;
 {noformat}
 We could see different invalid results or even no result which should return.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10752) Revert HIVE-5193

2015-05-24 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557830#comment-14557830
 ] 

Mithun Radhakrishnan commented on HIVE-10752:
-

Sorry, chaps. I'm on vacation. Tagging [~cdrome], [~viraj] (who worked on the 
original bug HIVE-5193). I'm afraid I won't be able to look at this till 
Wednesday. 

 Revert HIVE-5193
 

 Key: HIVE-10752
 URL: https://issues.apache.org/jira/browse/HIVE-10752
 Project: Hive
  Issue Type: Sub-task
  Components: HCatalog
Affects Versions: 1.2.0
Reporter: Aihua Xu
Assignee: Aihua Xu
 Attachments: HIVE-10752.patch


 Revert HIVE-5193 since it causes pig+hcatalog not working.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-11470) NPE in DynamicPartFileRecordWriterContainer on null part-keys.

2015-08-05 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-11470:

Attachment: HIVE-11470.1.patch

Here's the tentative fix. This breaks in 1.2 as well, but on the bright side, 
there's no silent data-loss. The NPE speaks volumes.

 NPE in DynamicPartFileRecordWriterContainer on null part-keys.
 --

 Key: HIVE-11470
 URL: https://issues.apache.org/jira/browse/HIVE-11470
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Affects Versions: 1.2.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-11470.1.patch


 When partitioning data using {{HCatStorer}}, one sees the following NPE, if 
 the dyn-part-key is of null-value:
 {noformat}
 2015-07-30 23:59:59,627 WARN [main] org.apache.hadoop.mapred.YarnChild: 
 Exception running child : java.io.IOException: java.lang.NullPointerException
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:473)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:436)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:416)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:256)
 at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
 at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
 Caused by: java.lang.NullPointerException
 at 
 org.apache.hive.hcatalog.mapreduce.DynamicPartitionFileRecordWriterContainer.getLocalFileWriter(DynamicPartitionFileRecordWriterContainer.java:141)
 at 
 org.apache.hive.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:110)
 at 
 org.apache.hive.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:54)
 at 
 org.apache.hive.hcatalog.pig.HCatBaseStorer.putNext(HCatBaseStorer.java:309)
 at org.apache.hive.hcatalog.pig.HCatStorer.putNext(HCatStorer.java:61)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
 at 
 org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:558)
 at 
 org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
 at 
 org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:471)
 ... 11 more
 {noformat}
 The reason is that the {{DynamicPartitionFileRecordWriterContainer}} makes an 
 unfortunate assumption when fetching a local file-writer instance:
 {code:title=DynamicPartitionFileRecordWriterContainer.java}
   @Override
   protected LocalFileWriter getLocalFileWriter(HCatRecord value) 
 throws IOException, HCatException {
 
 OutputJobInfo localJobInfo = null;
 // Calculate which writer to use from the remaining values - this needs to
 // be done before we delete cols.
 ListString dynamicPartValues = new ArrayListString();
 for (Integer colToAppend : dynamicPartCols) {
   dynamicPartValues.add(value.get(colToAppend).toString()); // -- YIKES!
 }
 ...
   }
 {code}
 Must check for null, and substitute with 
 {{\_\_HIVE_DEFAULT_PARTITION\_\_}}, or equivalent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11456) HCatStorer should honor mapreduce.output.basename

2015-08-05 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658981#comment-14658981
 ] 

Mithun Radhakrishnan commented on HIVE-11456:
-

Thanks for the heads-up, Sush. On the face of it, 
# In the first place, I'm not sure this'd be a problem, since Pig assumes 
insert-overwrite semantics, as opposed to insert into. But then again, this 
is dynamic-partitioning, so it's not like Pig could check the output directory, 
_a priori_.
# Pig 0.14 uses prefixes (vertex_id + edge_id) to make sure the file is unique. 
I don't foresee that the suffix might mess with it.

Permit me to ruminate on this.




 HCatStorer should honor mapreduce.output.basename
 -

 Key: HIVE-11456
 URL: https://issues.apache.org/jira/browse/HIVE-11456
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Rohini Palaniswamy
Assignee: Mithun Radhakrishnan
Priority: Critical
 Fix For: 1.3.0, 1.2.1, 2.0.0

 Attachments: HIVE-11456.1.patch


 Pig on Tez scripts with union directly followed by HCatStorer have a problem 
 due to HCatStorer not honoring mapreduce.output.basename and always using 
 part. Tez sets mapreduce.output.basename to part-v000-o000 (vertex id 
 followed by output id). With union optimizer, Pig uses vertex groups to write 
 directly from both the vertices to the final output directory. Since hcat 
 ignores the mapreduce.output.basename, both the vertices produce 
 part-r-n and when they are moved from the temp location to the final 
 directory, they just overwrite each other. There is no failure and only one 
 of the files with that name makes it into the final directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-11456) HCatStorer should honor mapreduce.output.basename

2015-08-04 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-11456:

Attachment: HIVE-11456.1.patch

The fix. The hard-codes for {{part}} in the file-names have been switched to 
use the conf-setting for {{mapreduce.output.basename}}.

Sorry for the delay.

 HCatStorer should honor mapreduce.output.basename
 -

 Key: HIVE-11456
 URL: https://issues.apache.org/jira/browse/HIVE-11456
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Rohini Palaniswamy
Assignee: Mithun Radhakrishnan
Priority: Critical
 Attachments: HIVE-11456.1.patch


 Pig on Tez scripts with union directly followed by HCatStorer have a problem 
 due to HCatStorer not honoring mapreduce.output.basename and always using 
 part. Tez sets mapreduce.output.basename to part-v000-o000 (vertex id 
 followed by output id). With union optimizer, Pig uses vertex groups to write 
 directly from both the vertices to the final output directory. Since hcat 
 ignores the mapreduce.output.basename, both the vertices produce 
 part-r-n and when they are moved from the temp location to the final 
 directory, they just overwrite each other. There is no failure and only one 
 of the files with that name makes it into the final directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-11548) HCatLoader should support predicate pushdown.

2015-08-13 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-11548:

Attachment: HIVE-11548.1.patch

Here's a tentative implementation.

 HCatLoader should support predicate pushdown.
 -

 Key: HIVE-11548
 URL: https://issues.apache.org/jira/browse/HIVE-11548
 Project: Hive
  Issue Type: New Feature
  Components: HCatalog
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-11548.1.patch


 When one uses {{HCatInputFormat}}/{{HCatLoader}} to read from file-formats 
 that support predicate pushdown (such as ORC, with 
 {{hive.optimize.index.filter=true}}), one sees that the predicates aren't 
 actually pushed down into the storage layer.
 The forthcoming patch should allow for filter-pushdown, if any of the 
 partitions being scanned with {{HCatLoader}} support the functionality. The 
 patch should technically allow the same for users of {{HCatInputFormat}}, but 
 I don't currently have a neat interface to build a compound 
 predicate-expression. Will add this separately, if required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11344) HIVE-9845 makes HCatSplit.write modify the split so that PartInfo objects are unusable after it

2015-07-23 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639272#comment-14639272
 ] 

Mithun Radhakrishnan commented on HIVE-11344:
-

Ah, that's a good point. I didn't realize that {{HCatSplit}} or {{PartInfo}} 
might be serialized in situations other than M/R / Tez serialization of splits.

At the time I wrote this, I did intend to check {{partitionSchema}}, 
{{inputFormatClassName}}, etc. for null, in their respective getters, and 
return the values from {{this.tableInfo}}. One optimization too far.

+1 to Solution (a).

 HIVE-9845 makes HCatSplit.write modify the split so that PartInfo objects are 
 unusable after it
 ---

 Key: HIVE-11344
 URL: https://issues.apache.org/jira/browse/HIVE-11344
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Sushanth Sowmyan
Assignee: Sushanth Sowmyan
 Attachments: HIVE-11344.patch


 HIVE-9845 introduced a notion of compression for HCatSplits so that when 
 serializing, it finds commonalities between PartInfo and TableInfo objects, 
 and if the two are identical, it nulls out that field in PartInfo, thus 
 making sure that when PartInfo is then serialized, info is not repeated.
 This, however, has the side effect of making the PartInfo object unusable if 
 HCatSplit.write has been called.
 While this does not affect M/R directly, since they do not know about the 
 PartInfo objects and once serialized, the HCatSplit object is recreated by 
 deserializing on the backend, which does restore the split and its PartInfo 
 objects, this does, however, affect framework users of HCat that try to mimic 
 M/R and then use the PartInfo objects to instantiate distinct readers.
 Thus, we need to make it so that PartInfo is still usable after 
 HCatSplit.write is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-11548) HCatLoader should support predicate pushdown.

2015-08-24 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-11548:

Attachment: (was: HIVE-11548.1.patch)

 HCatLoader should support predicate pushdown.
 -

 Key: HIVE-11548
 URL: https://issues.apache.org/jira/browse/HIVE-11548
 Project: Hive
  Issue Type: New Feature
  Components: HCatalog
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan

 When one uses {{HCatInputFormat}}/{{HCatLoader}} to read from file-formats 
 that support predicate pushdown (such as ORC, with 
 {{hive.optimize.index.filter=true}}), one sees that the predicates aren't 
 actually pushed down into the storage layer.
 The forthcoming patch should allow for filter-pushdown, if any of the 
 partitions being scanned with {{HCatLoader}} support the functionality. The 
 patch should technically allow the same for users of {{HCatInputFormat}}, but 
 I don't currently have a neat interface to build a compound 
 predicate-expression. Will add this separately, if required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-11548) HCatLoader should support predicate pushdown.

2015-08-24 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-11548:

Attachment: HIVE-11548.1.patch

Corrected the bad-code. Submitting for re-test.

 HCatLoader should support predicate pushdown.
 -

 Key: HIVE-11548
 URL: https://issues.apache.org/jira/browse/HIVE-11548
 Project: Hive
  Issue Type: New Feature
  Components: HCatalog
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-11548.1.patch


 When one uses {{HCatInputFormat}}/{{HCatLoader}} to read from file-formats 
 that support predicate pushdown (such as ORC, with 
 {{hive.optimize.index.filter=true}}), one sees that the predicates aren't 
 actually pushed down into the storage layer.
 The forthcoming patch should allow for filter-pushdown, if any of the 
 partitions being scanned with {{HCatLoader}} support the functionality. The 
 patch should technically allow the same for users of {{HCatInputFormat}}, but 
 I don't currently have a neat interface to build a compound 
 predicate-expression. Will add this separately, if required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11548) HCatLoader should support predicate pushdown.

2015-08-27 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717557#comment-14717557
 ] 

Mithun Radhakrishnan commented on HIVE-11548:
-

Alright. I was able to reproduce the 
{{TestHCatClient.testTableSchemaPropagation()}} problem. It seems to fail 
without this patch, so I'll work on that in a separate JIRA. I'm still having 
trouble getting {{TestPigHBaseStorageHandler}} to fail.

 HCatLoader should support predicate pushdown.
 -

 Key: HIVE-11548
 URL: https://issues.apache.org/jira/browse/HIVE-11548
 Project: Hive
  Issue Type: New Feature
  Components: HCatalog
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-11548.1.patch


 When one uses {{HCatInputFormat}}/{{HCatLoader}} to read from file-formats 
 that support predicate pushdown (such as ORC, with 
 {{hive.optimize.index.filter=true}}), one sees that the predicates aren't 
 actually pushed down into the storage layer.
 The forthcoming patch should allow for filter-pushdown, if any of the 
 partitions being scanned with {{HCatLoader}} support the functionality. The 
 patch should technically allow the same for users of {{HCatInputFormat}}, but 
 I don't currently have a neat interface to build a compound 
 predicate-expression. Will add this separately, if required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-11548) HCatLoader should support predicate pushdown.

2015-09-02 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-11548:

Attachment: HIVE-11548.2.patch

Fixed failing tests. {{TestHCatClient}} needed fixing independently of this 
fix. I'm squeezing it into this JIRA.

> HCatLoader should support predicate pushdown.
> -
>
> Key: HIVE-11548
> URL: https://issues.apache.org/jira/browse/HIVE-11548
> Project: Hive
>  Issue Type: New Feature
>  Components: HCatalog
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-11548.1.patch, HIVE-11548.2.patch
>
>
> When one uses {{HCatInputFormat}}/{{HCatLoader}} to read from file-formats 
> that support predicate pushdown (such as ORC, with 
> {{hive.optimize.index.filter=true}}), one sees that the predicates aren't 
> actually pushed down into the storage layer.
> The forthcoming patch should allow for filter-pushdown, if any of the 
> partitions being scanned with {{HCatLoader}} support the functionality. The 
> patch should technically allow the same for users of {{HCatInputFormat}}, but 
> I don't currently have a neat interface to build a compound 
> predicate-expression. Will add this separately, if required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12364) Distcp job fails when run under Tez

2015-12-08 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047682#comment-15047682
 ] 

Mithun Radhakrishnan commented on HIVE-12364:
-

Brilliant! Thank you, Prashanth. I'll check on these bugs for our internal 
branch. 

IMHO, given that they solve problems with an officially released version, we 
really should consider pulling them into branch-1. 

> Distcp job fails when run under Tez
> ---
>
> Key: HIVE-12364
> URL: https://issues.apache.org/jira/browse/HIVE-12364
> Project: Hive
>  Issue Type: Bug
>  Components: Tez
>Affects Versions: 1.3.0, 2.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HIVE-12364-branch-1.patch, HIVE-12364.patch
>
>
> PROBLEM:
> insert into/overwrite directory '/path' invokes distcp for moveTask and fails
> query when execution engine is Tez 
> set hive.exec.copyfile.maxsize=4;
> insert overwrite into '/tmp/testinser' select * from customer;
> failed at moveTask
> hive client log:
> {code}
> 2015-11-05 16:02:53,254 INFO  [main]: exec.FileSinkOperator 
> (Utilities.java:mvFileToFinalPath(1882)) - Moving tmp dir: 
> hdfs://hdpsecehdfs/tmp/testindir/.hive-staging_hive_2015-11-05_15-59-44_557_1429894387987411483-1/_tmp.-ext-1
>  to: 
> hdfs://hdpsecehdfs/tmp/testindir/.hive-staging_hive_2015-11-05_15-59-44_557_1429894387987411483-1/-ext-1
> 2015-11-05 16:02:53,611 INFO  [main]: log.PerfLogger 
> (PerfLogger.java:PerfLogBegin(121)) -  method=task.DEPENDENCY_COLLECTION.Stage-2 
> from=org.apache.hadoop.hive.ql.Driver>
> 2015-11-05 16:02:53,612 INFO  [main]: ql.Driver 
> (Driver.java:launchTask(1653)) - Starting task 
> [Stage-2:DEPENDENCY_COLLECTION] in serial mode
> 2015-11-05 16:02:53,612 INFO  [main]: log.PerfLogger 
> (PerfLogger.java:PerfLogBegin(121)) -  from=org.apache.hadoop.hive.ql.Driver>
> 2015-11-05 16:02:53,612 INFO  [main]: ql.Driver 
> (Driver.java:launchTask(1653)) - Starting task [Stage-0:MOVE] in serial mode
> 2015-11-05 16:02:53,612 INFO  [main]: exec.Task 
> (SessionState.java:printInfo(951)) - Moving data to: /tmp/testindir from 
> hdfs://hdpsecehdfs/tmp/testindir/.hive-staging_hive_2015-11-05_15-59-44_557_1429894387987411483-1/-ext-1
> 2015-11-05 16:02:53,637 INFO  [main]: common.FileUtils 
> (FileUtils.java:copy(551)) - Source is 491763261 bytes. (MAX: 4)
> 2015-11-05 16:02:53,638 INFO  [main]: common.FileUtils 
> (FileUtils.java:copy(552)) - Launch distributed copy (distcp) job.
> 2015-11-05 16:03:03,924 INFO  [main]: impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(296)) - Timeline service address: 
> http://hdpsece02.sece.hwxsup.com:8188/ws/v1/timeline/
> 2015-11-05 16:03:04,081 INFO  [main]: impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(296)) - Timeline service address: 
> http://hdpsece02.sece.hwxsup.com:8188/ws/v1/timeline/
> 2015-11-05 16:03:20,210 INFO  [main]: hdfs.DFSClient 
> (DFSClient.java:getDelegationToken(1047)) - Created HDFS_DELEGATION_TOKEN 
> token 1069 for haha on ha-hdfs:hdpsecehdfs
> 2015-11-05 16:03:20,249 INFO  [main]: security.TokenCache 
> (TokenCache.java:obtainTokensForNamenodesInternal(125)) - Got dt for 
> hdfs://hdpsecehdfs; Kind: HDFS_DELEGATION_TOKEN, Service: 
> ha-hdfs:hdpsecehdfs, Ident: (HDFS_DELEGATION_TOKEN token 1069 for haha)
> 2015-11-05 16:03:20,250 WARN  [main]: token.Token 
> (Token.java:getClassForIdentifier(121)) - Cannot find class for token kind 
> kms-dt
> 2015-11-05 16:03:20,250 INFO  [main]: security.TokenCache 
> (TokenCache.java:obtainTokensForNamenodesInternal(125)) - Got dt for 
> hdfs://hdpsecehdfs; Kind: kms-dt, Service: 172.25.17.102:9292, Ident: 00 04 
> 68 61 68 61 02 72 6d 00 8a 01 50 da 1a ca 29 8a 01 50 fe 27 4e 29 03 02
> 2015-11-05 16:03:22,561 INFO  [main]: Configuration.deprecation 
> (Configuration.java:warnOnceIfDeprecated(1173)) - io.sort.mb is deprecated. 
> Instead, use mapreduce.task.io.sort.mb
> 2015-11-05 16:03:22,562 INFO  [main]: Configuration.deprecation 
> (Configuration.java:warnOnceIfDeprecated(1173)) - io.sort.factor is 
> deprecated. Instead, use mapreduce.task.io.sort.factor
> 2015-11-05 16:03:33,733 ERROR [main]: exec.Task 
> (SessionState.java:printError(960)) - Failed with exception Unable to move 
> source 
> hdfs://hdpsecehdfs/tmp/testindir/.hive-staging_hive_2015-11-05_15-59-44_557_1429894387987411483-1/-ext-1
>  to destination /tmp/testindir
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://hdpsecehdfs/tmp/testindir/.hive-staging_hive_2015-11-05_15-59-44_557_1429894387987411483-1/-ext-1
>  to destination /tmp/testindir
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2665)
> at 

[jira] [Commented] (HIVE-12364) Distcp job fails when run under Tez

2015-12-08 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047561#comment-15047561
 ] 

Mithun Radhakrishnan commented on HIVE-12364:
-

+1 for backport to branch-1.

> Distcp job fails when run under Tez
> ---
>
> Key: HIVE-12364
> URL: https://issues.apache.org/jira/browse/HIVE-12364
> Project: Hive
>  Issue Type: Bug
>  Components: Tez
>Affects Versions: 1.3.0, 2.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HIVE-12364-branch-1.patch, HIVE-12364.patch
>
>
> PROBLEM:
> insert into/overwrite directory '/path' invokes distcp for moveTask and fails
> query when execution engine is Tez 
> set hive.exec.copyfile.maxsize=4;
> insert overwrite into '/tmp/testinser' select * from customer;
> failed at moveTask
> hive client log:
> {code}
> 2015-11-05 16:02:53,254 INFO  [main]: exec.FileSinkOperator 
> (Utilities.java:mvFileToFinalPath(1882)) - Moving tmp dir: 
> hdfs://hdpsecehdfs/tmp/testindir/.hive-staging_hive_2015-11-05_15-59-44_557_1429894387987411483-1/_tmp.-ext-1
>  to: 
> hdfs://hdpsecehdfs/tmp/testindir/.hive-staging_hive_2015-11-05_15-59-44_557_1429894387987411483-1/-ext-1
> 2015-11-05 16:02:53,611 INFO  [main]: log.PerfLogger 
> (PerfLogger.java:PerfLogBegin(121)) -  method=task.DEPENDENCY_COLLECTION.Stage-2 
> from=org.apache.hadoop.hive.ql.Driver>
> 2015-11-05 16:02:53,612 INFO  [main]: ql.Driver 
> (Driver.java:launchTask(1653)) - Starting task 
> [Stage-2:DEPENDENCY_COLLECTION] in serial mode
> 2015-11-05 16:02:53,612 INFO  [main]: log.PerfLogger 
> (PerfLogger.java:PerfLogBegin(121)) -  from=org.apache.hadoop.hive.ql.Driver>
> 2015-11-05 16:02:53,612 INFO  [main]: ql.Driver 
> (Driver.java:launchTask(1653)) - Starting task [Stage-0:MOVE] in serial mode
> 2015-11-05 16:02:53,612 INFO  [main]: exec.Task 
> (SessionState.java:printInfo(951)) - Moving data to: /tmp/testindir from 
> hdfs://hdpsecehdfs/tmp/testindir/.hive-staging_hive_2015-11-05_15-59-44_557_1429894387987411483-1/-ext-1
> 2015-11-05 16:02:53,637 INFO  [main]: common.FileUtils 
> (FileUtils.java:copy(551)) - Source is 491763261 bytes. (MAX: 4)
> 2015-11-05 16:02:53,638 INFO  [main]: common.FileUtils 
> (FileUtils.java:copy(552)) - Launch distributed copy (distcp) job.
> 2015-11-05 16:03:03,924 INFO  [main]: impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(296)) - Timeline service address: 
> http://hdpsece02.sece.hwxsup.com:8188/ws/v1/timeline/
> 2015-11-05 16:03:04,081 INFO  [main]: impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(296)) - Timeline service address: 
> http://hdpsece02.sece.hwxsup.com:8188/ws/v1/timeline/
> 2015-11-05 16:03:20,210 INFO  [main]: hdfs.DFSClient 
> (DFSClient.java:getDelegationToken(1047)) - Created HDFS_DELEGATION_TOKEN 
> token 1069 for haha on ha-hdfs:hdpsecehdfs
> 2015-11-05 16:03:20,249 INFO  [main]: security.TokenCache 
> (TokenCache.java:obtainTokensForNamenodesInternal(125)) - Got dt for 
> hdfs://hdpsecehdfs; Kind: HDFS_DELEGATION_TOKEN, Service: 
> ha-hdfs:hdpsecehdfs, Ident: (HDFS_DELEGATION_TOKEN token 1069 for haha)
> 2015-11-05 16:03:20,250 WARN  [main]: token.Token 
> (Token.java:getClassForIdentifier(121)) - Cannot find class for token kind 
> kms-dt
> 2015-11-05 16:03:20,250 INFO  [main]: security.TokenCache 
> (TokenCache.java:obtainTokensForNamenodesInternal(125)) - Got dt for 
> hdfs://hdpsecehdfs; Kind: kms-dt, Service: 172.25.17.102:9292, Ident: 00 04 
> 68 61 68 61 02 72 6d 00 8a 01 50 da 1a ca 29 8a 01 50 fe 27 4e 29 03 02
> 2015-11-05 16:03:22,561 INFO  [main]: Configuration.deprecation 
> (Configuration.java:warnOnceIfDeprecated(1173)) - io.sort.mb is deprecated. 
> Instead, use mapreduce.task.io.sort.mb
> 2015-11-05 16:03:22,562 INFO  [main]: Configuration.deprecation 
> (Configuration.java:warnOnceIfDeprecated(1173)) - io.sort.factor is 
> deprecated. Instead, use mapreduce.task.io.sort.factor
> 2015-11-05 16:03:33,733 ERROR [main]: exec.Task 
> (SessionState.java:printError(960)) - Failed with exception Unable to move 
> source 
> hdfs://hdpsecehdfs/tmp/testindir/.hive-staging_hive_2015-11-05_15-59-44_557_1429894387987411483-1/-ext-1
>  to destination /tmp/testindir
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://hdpsecehdfs/tmp/testindir/.hive-staging_hive_2015-11-05_15-59-44_557_1429894387987411483-1/-ext-1
>  to destination /tmp/testindir
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2665)
> at org.apache.hadoop.hive.ql.exec.MoveTask.moveFile(MoveTask.java:105)
> at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:222)
> at 

[jira] [Commented] (HIVE-12364) Distcp job fails when run under Tez

2015-12-08 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047580#comment-15047580
 ] 

Mithun Radhakrishnan commented on HIVE-12364:
-

Hey, [~prasanth_j]. I'd be keen on knowing what other patches this might need. 
Were these also critical bug-fixes?

This patch seems pretty self-contained. It's the result of Hive's own 
InputFormats implementing old-api, while DistCp uses the new MR apis. We ran 
into this problem today. I also ran into a similar problem when introducing a 
custom MR job for something else I'm working on. (That I'll explain in another 
JIRA.)



> Distcp job fails when run under Tez
> ---
>
> Key: HIVE-12364
> URL: https://issues.apache.org/jira/browse/HIVE-12364
> Project: Hive
>  Issue Type: Bug
>  Components: Tez
>Affects Versions: 1.3.0, 2.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HIVE-12364-branch-1.patch, HIVE-12364.patch
>
>
> PROBLEM:
> insert into/overwrite directory '/path' invokes distcp for moveTask and fails
> query when execution engine is Tez 
> set hive.exec.copyfile.maxsize=4;
> insert overwrite into '/tmp/testinser' select * from customer;
> failed at moveTask
> hive client log:
> {code}
> 2015-11-05 16:02:53,254 INFO  [main]: exec.FileSinkOperator 
> (Utilities.java:mvFileToFinalPath(1882)) - Moving tmp dir: 
> hdfs://hdpsecehdfs/tmp/testindir/.hive-staging_hive_2015-11-05_15-59-44_557_1429894387987411483-1/_tmp.-ext-1
>  to: 
> hdfs://hdpsecehdfs/tmp/testindir/.hive-staging_hive_2015-11-05_15-59-44_557_1429894387987411483-1/-ext-1
> 2015-11-05 16:02:53,611 INFO  [main]: log.PerfLogger 
> (PerfLogger.java:PerfLogBegin(121)) -  method=task.DEPENDENCY_COLLECTION.Stage-2 
> from=org.apache.hadoop.hive.ql.Driver>
> 2015-11-05 16:02:53,612 INFO  [main]: ql.Driver 
> (Driver.java:launchTask(1653)) - Starting task 
> [Stage-2:DEPENDENCY_COLLECTION] in serial mode
> 2015-11-05 16:02:53,612 INFO  [main]: log.PerfLogger 
> (PerfLogger.java:PerfLogBegin(121)) -  from=org.apache.hadoop.hive.ql.Driver>
> 2015-11-05 16:02:53,612 INFO  [main]: ql.Driver 
> (Driver.java:launchTask(1653)) - Starting task [Stage-0:MOVE] in serial mode
> 2015-11-05 16:02:53,612 INFO  [main]: exec.Task 
> (SessionState.java:printInfo(951)) - Moving data to: /tmp/testindir from 
> hdfs://hdpsecehdfs/tmp/testindir/.hive-staging_hive_2015-11-05_15-59-44_557_1429894387987411483-1/-ext-1
> 2015-11-05 16:02:53,637 INFO  [main]: common.FileUtils 
> (FileUtils.java:copy(551)) - Source is 491763261 bytes. (MAX: 4)
> 2015-11-05 16:02:53,638 INFO  [main]: common.FileUtils 
> (FileUtils.java:copy(552)) - Launch distributed copy (distcp) job.
> 2015-11-05 16:03:03,924 INFO  [main]: impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(296)) - Timeline service address: 
> http://hdpsece02.sece.hwxsup.com:8188/ws/v1/timeline/
> 2015-11-05 16:03:04,081 INFO  [main]: impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(296)) - Timeline service address: 
> http://hdpsece02.sece.hwxsup.com:8188/ws/v1/timeline/
> 2015-11-05 16:03:20,210 INFO  [main]: hdfs.DFSClient 
> (DFSClient.java:getDelegationToken(1047)) - Created HDFS_DELEGATION_TOKEN 
> token 1069 for haha on ha-hdfs:hdpsecehdfs
> 2015-11-05 16:03:20,249 INFO  [main]: security.TokenCache 
> (TokenCache.java:obtainTokensForNamenodesInternal(125)) - Got dt for 
> hdfs://hdpsecehdfs; Kind: HDFS_DELEGATION_TOKEN, Service: 
> ha-hdfs:hdpsecehdfs, Ident: (HDFS_DELEGATION_TOKEN token 1069 for haha)
> 2015-11-05 16:03:20,250 WARN  [main]: token.Token 
> (Token.java:getClassForIdentifier(121)) - Cannot find class for token kind 
> kms-dt
> 2015-11-05 16:03:20,250 INFO  [main]: security.TokenCache 
> (TokenCache.java:obtainTokensForNamenodesInternal(125)) - Got dt for 
> hdfs://hdpsecehdfs; Kind: kms-dt, Service: 172.25.17.102:9292, Ident: 00 04 
> 68 61 68 61 02 72 6d 00 8a 01 50 da 1a ca 29 8a 01 50 fe 27 4e 29 03 02
> 2015-11-05 16:03:22,561 INFO  [main]: Configuration.deprecation 
> (Configuration.java:warnOnceIfDeprecated(1173)) - io.sort.mb is deprecated. 
> Instead, use mapreduce.task.io.sort.mb
> 2015-11-05 16:03:22,562 INFO  [main]: Configuration.deprecation 
> (Configuration.java:warnOnceIfDeprecated(1173)) - io.sort.factor is 
> deprecated. Instead, use mapreduce.task.io.sort.factor
> 2015-11-05 16:03:33,733 ERROR [main]: exec.Task 
> (SessionState.java:printError(960)) - Failed with exception Unable to move 
> source 
> hdfs://hdpsecehdfs/tmp/testindir/.hive-staging_hive_2015-11-05_15-59-44_557_1429894387987411483-1/-ext-1
>  to destination /tmp/testindir
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> 

[jira] [Updated] (HIVE-11470) NPE in DynamicPartFileRecordWriterContainer on null part-keys.

2015-12-22 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-11470:

Attachment: HIVE-11470.2.patch

Here's the proper fix. Submitting for a re-test.

> NPE in DynamicPartFileRecordWriterContainer on null part-keys.
> --
>
> Key: HIVE-11470
> URL: https://issues.apache.org/jira/browse/HIVE-11470
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-11470.1.patch, HIVE-11470.2.patch
>
>
> When partitioning data using {{HCatStorer}}, one sees the following NPE, if 
> the dyn-part-key is of null-value:
> {noformat}
> 2015-07-30 23:59:59,627 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : java.io.IOException: java.lang.NullPointerException
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:473)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:436)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:416)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:256)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hive.hcatalog.mapreduce.DynamicPartitionFileRecordWriterContainer.getLocalFileWriter(DynamicPartitionFileRecordWriterContainer.java:141)
> at 
> org.apache.hive.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:110)
> at 
> org.apache.hive.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:54)
> at 
> org.apache.hive.hcatalog.pig.HCatBaseStorer.putNext(HCatBaseStorer.java:309)
> at org.apache.hive.hcatalog.pig.HCatStorer.putNext(HCatStorer.java:61)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
> at 
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:558)
> at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
> at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:471)
> ... 11 more
> {noformat}
> The reason is that the {{DynamicPartitionFileRecordWriterContainer}} makes an 
> unfortunate assumption when fetching a local file-writer instance:
> {code:title=DynamicPartitionFileRecordWriterContainer.java}
>   @Override
>   protected LocalFileWriter getLocalFileWriter(HCatRecord value) 
> throws IOException, HCatException {
> 
> OutputJobInfo localJobInfo = null;
> // Calculate which writer to use from the remaining values - this needs to
> // be done before we delete cols.
> List dynamicPartValues = new ArrayList();
> for (Integer colToAppend : dynamicPartCols) {
>   dynamicPartValues.add(value.get(colToAppend).toString()); // <-- YIKES!
> }
> ...
>   }
> {code}
> Must check for null, and substitute with 
> {{"\_\_HIVE_DEFAULT_PARTITION\_\_"}}, or equivalent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-11548) HCatLoader should support predicate pushdown.

2015-12-22 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-11548:

Attachment: HIVE-11548.3.patch

> HCatLoader should support predicate pushdown.
> -
>
> Key: HIVE-11548
> URL: https://issues.apache.org/jira/browse/HIVE-11548
> Project: Hive
>  Issue Type: New Feature
>  Components: HCatalog
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-11548.1.patch, HIVE-11548.2.patch, 
> HIVE-11548.3.patch
>
>
> When one uses {{HCatInputFormat}}/{{HCatLoader}} to read from file-formats 
> that support predicate pushdown (such as ORC, with 
> {{hive.optimize.index.filter=true}}), one sees that the predicates aren't 
> actually pushed down into the storage layer.
> The forthcoming patch should allow for filter-pushdown, if any of the 
> partitions being scanned with {{HCatLoader}} support the functionality. The 
> patch should technically allow the same for users of {{HCatInputFormat}}, but 
> I don't currently have a neat interface to build a compound 
> predicate-expression. Will add this separately, if required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11470) NPE in DynamicPartFileRecordWriterContainer on null part-keys.

2016-02-12 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145124#comment-15145124
 ] 

Mithun Radhakrishnan commented on HIVE-11470:
-

Thanks for working on this, [~sushanth]!

> NPE in DynamicPartFileRecordWriterContainer on null part-keys.
> --
>
> Key: HIVE-11470
> URL: https://issues.apache.org/jira/browse/HIVE-11470
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Fix For: 2.0.0, 1.2.2, 2.1.0
>
> Attachments: HIVE-11470.1.patch, HIVE-11470.2.patch
>
>
> When partitioning data using {{HCatStorer}}, one sees the following NPE, if 
> the dyn-part-key is of null-value:
> {noformat}
> 2015-07-30 23:59:59,627 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : java.io.IOException: java.lang.NullPointerException
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:473)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:436)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:416)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:256)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hive.hcatalog.mapreduce.DynamicPartitionFileRecordWriterContainer.getLocalFileWriter(DynamicPartitionFileRecordWriterContainer.java:141)
> at 
> org.apache.hive.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:110)
> at 
> org.apache.hive.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:54)
> at 
> org.apache.hive.hcatalog.pig.HCatBaseStorer.putNext(HCatBaseStorer.java:309)
> at org.apache.hive.hcatalog.pig.HCatStorer.putNext(HCatStorer.java:61)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
> at 
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:558)
> at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
> at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:471)
> ... 11 more
> {noformat}
> The reason is that the {{DynamicPartitionFileRecordWriterContainer}} makes an 
> unfortunate assumption when fetching a local file-writer instance:
> {code:title=DynamicPartitionFileRecordWriterContainer.java}
>   @Override
>   protected LocalFileWriter getLocalFileWriter(HCatRecord value) 
> throws IOException, HCatException {
> 
> OutputJobInfo localJobInfo = null;
> // Calculate which writer to use from the remaining values - this needs to
> // be done before we delete cols.
> List dynamicPartValues = new ArrayList();
> for (Integer colToAppend : dynamicPartCols) {
>   dynamicPartValues.add(value.get(colToAppend).toString()); // <-- YIKES!
> }
> ...
>   }
> {code}
> Must check for null, and substitute with 
> {{"\_\_HIVE_DEFAULT_PARTITION\_\_"}}, or equivalent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13370) Add test for HIVE-11470

2016-03-28 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214840#comment-15214840
 ] 

Mithun Radhakrishnan commented on HIVE-13370:
-

Thanks for adding the test, [~sushanth]. 
+1. Looks good.

> Add test for HIVE-11470
> ---
>
> Key: HIVE-13370
> URL: https://issues.apache.org/jira/browse/HIVE-13370
> Project: Hive
>  Issue Type: Bug
>Reporter: Sushanth Sowmyan
>Assignee: Sushanth Sowmyan
> Attachments: HIVE-13370.patch
>
>
> HIVE-11470 added capability to handle NULL dynamic partitioning keys 
> properly. However, it did not add a test for the case, we should have one so 
> we don't have future regressions of the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12158) Add methods to HCatClient for partition synchronization

2016-03-19 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198449#comment-15198449
 ] 

Mithun Radhakrishnan commented on HIVE-12158:
-

[~sushanth], [~nahguam], sorry for the delay.

I agree with the spirit of this patch. Thank you for working on this. (I just 
came across a user who needed this as well.)

Please change HCatClientHMSImpl.java::Line#537 to compare table-name instead of 
db-name. I'll take a closer look at your tests. (Tests! Thank you!)

> Add methods to HCatClient for partition synchronization
> ---
>
> Key: HIVE-12158
> URL: https://issues.apache.org/jira/browse/HIVE-12158
> Project: Hive
>  Issue Type: Improvement
>  Components: HCatalog
>Affects Versions: 2.0.0
>Reporter: David Maughan
>Assignee: David Maughan
>Priority: Minor
>  Labels: hcatalog
> Attachments: HIVE-12158.1.patch
>
>
> We have a use case where we have a list of partitions that are created as a 
> result of a batch job (new or updated) outside of Hive and would like to 
> synchronize them with the Hive MetaStore. We would like to use the HCatalog 
> {{HCatClient}} but it currently does not seem to support this. However it is 
> possible with the {{HiveMetaStoreClient}} directly. I am proposing to add the 
> following method to {{HCatClient}} and {{HCatClientHMSImpl}}:
> A method for altering partitions. The implementation would delegate to 
> {{HiveMetaStoreClient#alter_partitions}}. I've used "update" instead of 
> "alter" in the name so it's consistent with the 
> {{HCatClient#updateTableSchema}} method.
> {code}
> public void updatePartitions(List partitions) throws 
> HCatException
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path

2016-04-25 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15256936#comment-15256936
 ] 

Mithun Radhakrishnan commented on HIVE-13509:
-

Sorry for delaying you on this. If I don't have feedback for you tomorrow, 
please go ahead and check in as is. I'll trust [~szehon]'s review. :] Thanks 
for keeping the default behavior. 

> HCatalog getSplits should ignore the partition with invalid path
> 
>
> Key: HIVE-13509
> URL: https://issues.apache.org/jira/browse/HIVE-13509
> Project: Hive
>  Issue Type: Improvement
>  Components: HCatalog
>Reporter: Chaoyu Tang
>Assignee: Chaoyu Tang
> Attachments: HIVE-13509.1.patch, HIVE-13509.patch
>
>
> It is quite common that there is the discrepancy between partition directory 
> and its HMS metadata, simply because the directory could be added/deleted 
> externally using hdfs shell command. Technically it should be fixed by MSCK 
> and alter table .. add/drop command etc, but sometimes it might not be 
> practical especially in a multi-tenant env. This discrepancy does not cause 
> any problem to Hive, Hive returns no rows for a partition with an invalid 
> (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because 
> the HCatBaseInputFormat getSplits throws an error when getting a split for a 
> non-existing path. The error message might looks like:
> {code}
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: 
> hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>   at 
> org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path

2016-04-26 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258454#comment-15258454
 ] 

Mithun Radhakrishnan commented on HIVE-13509:
-

Reviewing your patch now. On the face of it, it looks good. Looking at it a 
little more closely...

A couple of observations:
# {{hcat.input.ignore.invalid.path}} is well-named, and would make sense to 
anyone who'd want to override the default. (I thought we'd go with 
{{hcat.input.allow.invalid.path=true}}, but your version is better.
# Consider replacing {{(pathString == null || pathString.trim().isEmpty())}} 
with {{StringUtils.isBlank(pathString)}}.
# Nitpick: Consider replacing the loop at {{HCatBaseInputFormat.java:Line#335}} 
with Google Guava's {{Iterators.filter()}}. Then, depending on whether 
{{ignoreInvalidPath}} is set, the erstwhile loop at Line#329 will either loop 
on {{paths}} or on {{filteredPaths}}. This will be more readable.

> HCatalog getSplits should ignore the partition with invalid path
> 
>
> Key: HIVE-13509
> URL: https://issues.apache.org/jira/browse/HIVE-13509
> Project: Hive
>  Issue Type: Improvement
>  Components: HCatalog
>Reporter: Chaoyu Tang
>Assignee: Chaoyu Tang
> Attachments: HIVE-13509.1.patch, HIVE-13509.patch
>
>
> It is quite common that there is the discrepancy between partition directory 
> and its HMS metadata, simply because the directory could be added/deleted 
> externally using hdfs shell command. Technically it should be fixed by MSCK 
> and alter table .. add/drop command etc, but sometimes it might not be 
> practical especially in a multi-tenant env. This discrepancy does not cause 
> any problem to Hive, Hive returns no rows for a partition with an invalid 
> (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because 
> the HCatBaseInputFormat getSplits throws an error when getting a split for a 
> non-existing path. The error message might looks like:
> {code}
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: 
> hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>   at 
> org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path

2016-04-26 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258527#comment-15258527
 ] 

Mithun Radhakrishnan commented on HIVE-13509:
-

bq. ... with Google Guava's {{Iterators.filter()}}.
Actually, please ignore comment#3, above. 

I was trying to avoid checking {{ignoreInvalidPath}} multiple times. I tried 
writing it out myself (to illustrate), and saw that the call to 
{{fs.makeQualified()}} implies that we'll need to use both 
{{Iterators.filter()}} and {{Iterators.transform}}, at which point, it's no 
longer short and sweet. 

Please fix #2 above, and I will +1.

Also, thanks for adding tests.

> HCatalog getSplits should ignore the partition with invalid path
> 
>
> Key: HIVE-13509
> URL: https://issues.apache.org/jira/browse/HIVE-13509
> Project: Hive
>  Issue Type: Improvement
>  Components: HCatalog
>Reporter: Chaoyu Tang
>Assignee: Chaoyu Tang
> Attachments: HIVE-13509.1.patch, HIVE-13509.patch
>
>
> It is quite common that there is the discrepancy between partition directory 
> and its HMS metadata, simply because the directory could be added/deleted 
> externally using hdfs shell command. Technically it should be fixed by MSCK 
> and alter table .. add/drop command etc, but sometimes it might not be 
> practical especially in a multi-tenant env. This discrepancy does not cause 
> any problem to Hive, Hive returns no rows for a partition with an invalid 
> (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because 
> the HCatBaseInputFormat getSplits throws an error when getting a split for a 
> non-existing path. The error message might looks like:
> {code}
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: 
> hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>   at 
> org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path

2016-04-14 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242058#comment-15242058
 ] 

Mithun Radhakrishnan commented on HIVE-13509:
-

I knew this would be a sticking point with the Pig folks. ([~rohini], et al.) 
I'm afraid I agree with their assessment as well. 

Changing the default behaviour of {{HCatLoader}} to break Pig semantics would 
be incorrect, and would hide problems with missing data. We've run into 
failures/bugs in the {{FileOutputCommitterContainer}} that thankfully didn't 
perpetuate downstream, thanks to the current behaviour.

Can we keep the default behaviour, with a client-side option to ignore missing 
data?

> HCatalog getSplits should ignore the partition with invalid path
> 
>
> Key: HIVE-13509
> URL: https://issues.apache.org/jira/browse/HIVE-13509
> Project: Hive
>  Issue Type: Improvement
>  Components: HCatalog
>Reporter: Chaoyu Tang
>Assignee: Chaoyu Tang
> Attachments: HIVE-13509.patch
>
>
> It is quite common that there is the discrepancy between partition directory 
> and its HMS metadata, simply because the directory could be added/deleted 
> externally using hdfs shell command. Technically it should be fixed by MSCK 
> and alter table .. add/drop command etc, but sometimes it might not be 
> practical especially in a multi-tenant env. This discrepancy does not cause 
> any problem to Hive, Hive returns no rows for a partition with an invalid 
> (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because 
> the HCatBaseInputFormat getSplits throws an error when getting a split for a 
> non-existing path. The error message might looks like:
> {code}
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: 
> hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>   at 
> org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path

2016-04-15 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243789#comment-15243789
 ] 

Mithun Radhakrishnan commented on HIVE-13509:
-

I'm stuck on production-support, at the moment. I'd review this on Monday. 
Sorry for the delay.

> HCatalog getSplits should ignore the partition with invalid path
> 
>
> Key: HIVE-13509
> URL: https://issues.apache.org/jira/browse/HIVE-13509
> Project: Hive
>  Issue Type: Improvement
>  Components: HCatalog
>Reporter: Chaoyu Tang
>Assignee: Chaoyu Tang
> Attachments: HIVE-13509.1.patch, HIVE-13509.patch
>
>
> It is quite common that there is the discrepancy between partition directory 
> and its HMS metadata, simply because the directory could be added/deleted 
> externally using hdfs shell command. Technically it should be fixed by MSCK 
> and alter table .. add/drop command etc, but sometimes it might not be 
> practical especially in a multi-tenant env. This discrepancy does not cause 
> any problem to Hive, Hive returns no rows for a partition with an invalid 
> (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because 
> the HCatBaseInputFormat getSplits throws an error when getting a split for a 
> non-existing path. The error message might looks like:
> {code}
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: 
> hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>   at 
> org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path

2016-04-13 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240190#comment-15240190
 ] 

Mithun Radhakrishnan commented on HIVE-13509:
-

+[~daijy], [~rohini].

One possible concern is the disconnect between Hive and Pig:
# When one attempts to consume a *non-existent* directory (i.e. not just an 
empty directory) through Pig, one gets a failure.
# When one attempts to consume a non-existent partition (e.g. 
{{dt='3016-04-13'}}) in Hive, via  an unsatisfied partition-predicate, the 
query runs successfully (and returns nothing).

In ETL jobs using Pig, we might actually prefer a failure when the input data 
isn't available. Wouldn't this fix break those semantics for Pig?

> HCatalog getSplits should ignore the partition with invalid path
> 
>
> Key: HIVE-13509
> URL: https://issues.apache.org/jira/browse/HIVE-13509
> Project: Hive
>  Issue Type: Improvement
>  Components: HCatalog
>Reporter: Chaoyu Tang
>Assignee: Chaoyu Tang
> Attachments: HIVE-13509.patch
>
>
> It is quite common that there is the discrepancy between partition directory 
> and its HMS metadata, simply because the directory could be added/deleted 
> externally using hdfs shell command. Technically it should be fixed by MSCK 
> and alter table .. add/drop command etc, but sometimes it might not be 
> practical especially in a multi-tenant env. This discrepancy does not cause 
> any problem to Hive, Hive returns no rows for a partition with an invalid 
> (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because 
> the HCatBaseInputFormat getSplits throws an error when getting a split for a 
> non-existing path. The error message might looks like:
> {code}
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: 
> hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>   at 
> org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path

2016-04-26 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258780#comment-15258780
 ] 

Mithun Radhakrishnan commented on HIVE-13509:
-

Yes, sir. +1.

> HCatalog getSplits should ignore the partition with invalid path
> 
>
> Key: HIVE-13509
> URL: https://issues.apache.org/jira/browse/HIVE-13509
> Project: Hive
>  Issue Type: Improvement
>  Components: HCatalog
>Reporter: Chaoyu Tang
>Assignee: Chaoyu Tang
> Attachments: HIVE-13509.1.patch, HIVE-13509.2.patch, HIVE-13509.patch
>
>
> It is quite common that there is the discrepancy between partition directory 
> and its HMS metadata, simply because the directory could be added/deleted 
> externally using hdfs shell command. Technically it should be fixed by MSCK 
> and alter table .. add/drop command etc, but sometimes it might not be 
> practical especially in a multi-tenant env. This discrepancy does not cause 
> any problem to Hive, Hive returns no rows for a partition with an invalid 
> (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because 
> the HCatBaseInputFormat getSplits throws an error when getting a split for a 
> non-existing path. The error message might looks like:
> {code}
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: 
> hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>   at 
> org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-14380) Queries on tables with remote HDFS paths fail in "encryption" checks.

2016-07-28 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-14380:

Description: 
If a table has table/partition locations set to remote HDFS paths, querying 
them will cause the following IAException:

{noformat}
2016-07-26 01:16:27,471 ERROR parse.CalcitePlanner 
(SemanticAnalyzer.java:getMetaData(1867)) - 
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if 
hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table is encrypted: 
java.lang.IllegalArgumentException: Wrong FS: 
hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table, expected: 
hdfs://bar.ygrid.yahoo.com:8020
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:2204)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:2274)
...
{noformat}

This is because of the following code in {{SessionState}}:
{code:title=SessionState.java|borderStyle=solid}
 public HadoopShims.HdfsEncryptionShim getHdfsEncryptionShim() throws 
HiveException {
if (hdfsEncryptionShim == null) {
  try {
FileSystem fs = FileSystem.get(sessionConf);
if ("hdfs".equals(fs.getUri().getScheme())) {
  hdfsEncryptionShim = 
ShimLoader.getHadoopShims().createHdfsEncryptionShim(fs, sessionConf);
} else {
  LOG.debug("Could not get hdfsEncryptionShim, it is only applicable to 
hdfs filesystem.");
}
  } catch (Exception e) {
throw new HiveException(e);
  }
}

return hdfsEncryptionShim;
  }
{code}

When the {{FileSystem}} instance is created, using the {{sessionConf}} implies 
that the current HDFS is going to be used. This call should instead fetch the 
{{FileSystem}} instance corresponding to the path being checked.

A fix is forthcoming...

  was:
If a table has table/partition locations set to remote HDFS paths, querying 
them will cause the following IAException:

{noformat}
2016-07-26 01:16:27,471 ERROR parse.CalcitePlanner 
(SemanticAnalyzer.java:getMetaData(1867)) - 
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to deter
mine if hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table is encrypted: 
java.lang.IllegalArgumentException: Wrong FS: 
hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table, expected: 
hdfs://bar.ygrid.yahoo.com:8020
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:2204)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:2274)
...
{noformat}

This is because of the following code in {{SessionState}}:
{code:title=SessionState.java|borderStyle=solid}
 public HadoopShims.HdfsEncryptionShim getHdfsEncryptionShim() throws 
HiveException {
if (hdfsEncryptionShim == null) {
  try {
FileSystem fs = FileSystem.get(sessionConf);
if ("hdfs".equals(fs.getUri().getScheme())) {
  hdfsEncryptionShim = 
ShimLoader.getHadoopShims().createHdfsEncryptionShim(fs, sessionConf);
} else {
  LOG.debug("Could not get hdfsEncryptionShim, it is only applicable to 
hdfs filesystem.");
}
  } catch (Exception e) {
throw new HiveException(e);
  }
}

return hdfsEncryptionShim;
  }
{code}

When the {{FileSystem}} instance is created, using the {{sessionConf}} implies 
that the current HDFS is going to be used. This call should instead fetch the 
{{FileSystem}} instance corresponding to the path being checked.

A fix is forthcoming...


> Queries on tables with remote HDFS paths fail in "encryption" checks.
> -
>
> Key: HIVE-14380
> URL: https://issues.apache.org/jira/browse/HIVE-14380
> Project: Hive
>  Issue Type: Bug
>  Components: Encryption
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
>
> If a table has table/partition locations set to remote HDFS paths, querying 
> them will cause the following IAException:
> {noformat}
> 2016-07-26 01:16:27,471 ERROR parse.CalcitePlanner 
> (SemanticAnalyzer.java:getMetaData(1867)) - 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table is encrypted: 
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table, expected: 
> hdfs://bar.ygrid.yahoo.com:8020
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:2204)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:2274)
> ...
> {noformat}
> This is because of the following code in {{SessionState}}:
> 

[jira] [Updated] (HIVE-14380) Queries on tables with remote HDFS paths fail in "encryption" checks.

2016-07-28 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-14380:

Status: Patch Available  (was: Open)

> Queries on tables with remote HDFS paths fail in "encryption" checks.
> -
>
> Key: HIVE-14380
> URL: https://issues.apache.org/jira/browse/HIVE-14380
> Project: Hive
>  Issue Type: Bug
>  Components: Encryption
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14380.1.patch
>
>
> If a table has table/partition locations set to remote HDFS paths, querying 
> them will cause the following IAException:
> {noformat}
> 2016-07-26 01:16:27,471 ERROR parse.CalcitePlanner 
> (SemanticAnalyzer.java:getMetaData(1867)) - 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table is encrypted: 
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table, expected: 
> hdfs://bar.ygrid.yahoo.com:8020
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:2204)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:2274)
> ...
> {noformat}
> This is because of the following code in {{SessionState}}:
> {code:title=SessionState.java|borderStyle=solid}
>  public HadoopShims.HdfsEncryptionShim getHdfsEncryptionShim() throws 
> HiveException {
> if (hdfsEncryptionShim == null) {
>   try {
> FileSystem fs = FileSystem.get(sessionConf);
> if ("hdfs".equals(fs.getUri().getScheme())) {
>   hdfsEncryptionShim = 
> ShimLoader.getHadoopShims().createHdfsEncryptionShim(fs, sessionConf);
> } else {
>   LOG.debug("Could not get hdfsEncryptionShim, it is only applicable 
> to hdfs filesystem.");
> }
>   } catch (Exception e) {
> throw new HiveException(e);
>   }
> }
> return hdfsEncryptionShim;
>   }
> {code}
> When the {{FileSystem}} instance is created, using the {{sessionConf}} 
> implies that the current HDFS is going to be used. This call should instead 
> fetch the {{FileSystem}} instance corresponding to the path being checked.
> A fix is forthcoming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-14380) Queries on tables with remote HDFS paths fail in "encryption" checks.

2016-07-28 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-14380:

Attachment: HIVE-14380.1.patch

The tentative fix.

> Queries on tables with remote HDFS paths fail in "encryption" checks.
> -
>
> Key: HIVE-14380
> URL: https://issues.apache.org/jira/browse/HIVE-14380
> Project: Hive
>  Issue Type: Bug
>  Components: Encryption
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14380.1.patch
>
>
> If a table has table/partition locations set to remote HDFS paths, querying 
> them will cause the following IAException:
> {noformat}
> 2016-07-26 01:16:27,471 ERROR parse.CalcitePlanner 
> (SemanticAnalyzer.java:getMetaData(1867)) - 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table is encrypted: 
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table, expected: 
> hdfs://bar.ygrid.yahoo.com:8020
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:2204)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:2274)
> ...
> {noformat}
> This is because of the following code in {{SessionState}}:
> {code:title=SessionState.java|borderStyle=solid}
>  public HadoopShims.HdfsEncryptionShim getHdfsEncryptionShim() throws 
> HiveException {
> if (hdfsEncryptionShim == null) {
>   try {
> FileSystem fs = FileSystem.get(sessionConf);
> if ("hdfs".equals(fs.getUri().getScheme())) {
>   hdfsEncryptionShim = 
> ShimLoader.getHadoopShims().createHdfsEncryptionShim(fs, sessionConf);
> } else {
>   LOG.debug("Could not get hdfsEncryptionShim, it is only applicable 
> to hdfs filesystem.");
> }
>   } catch (Exception e) {
> throw new HiveException(e);
>   }
> }
> return hdfsEncryptionShim;
>   }
> {code}
> When the {{FileSystem}} instance is created, using the {{sessionConf}} 
> implies that the current HDFS is going to be used. This call should instead 
> fetch the {{FileSystem}} instance corresponding to the path being checked.
> A fix is forthcoming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-13756) Map failure attempts to delete reducer _temporary directory on multi-query pig query

2016-08-10 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-13756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-13756:

  Resolution: Fixed
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0, 1.2.1  (was: 1.2.1, 2.0.0)
  Status: Resolved  (was: Patch Available)

> Map failure attempts to delete reducer _temporary directory on multi-query 
> pig query
> 
>
> Key: HIVE-13756
> URL: https://issues.apache.org/jira/browse/HIVE-13756
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.1, 2.0.0
>Reporter: Chris Drome
>Assignee: Chris Drome
> Fix For: 2.0.0
>
> Attachments: HIVE-13756-branch-1.patch, HIVE-13756.1-branch-1.patch, 
> HIVE-13756.1.patch, HIVE-13756.patch
>
>
> A pig script, executed with multi-query enabled, that reads the source data 
> and writes it as-is into TABLE_A as well as performing a group-by operation 
> on the data which is written into TABLE_B can produce erroneous results if 
> any map fails. This results in a single MR job that writes the map output to 
> a scratch directory relative to TABLE_A and the reducer output to a scratch 
> directory relative to TABLE_B.
> If one or more maps fail it will delete the attempt data relative to TABLE_A, 
> but it also deletes the _temporary directory relative to TABLE_B. This has 
> the unintended side-effect of preventing subsequent maps from committing 
> their data. This means that any maps which successfully completed before the 
> first map failure will have its data committed as expected, other maps not, 
> resulting in an incomplete result set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-13754) Fix resource leak in HiveClientCache

2016-08-10 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-13754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-13754:

  Resolution: Fixed
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0, 1.2.1  (was: 1.2.1, 2.0.0)
  Status: Resolved  (was: Patch Available)

> Fix resource leak in HiveClientCache
> 
>
> Key: HIVE-13754
> URL: https://issues.apache.org/jira/browse/HIVE-13754
> Project: Hive
>  Issue Type: Bug
>  Components: Clients
>Affects Versions: 1.2.1, 2.0.0
>Reporter: Chris Drome
>Assignee: Chris Drome
> Fix For: 2.0.0
>
> Attachments: HIVE-13754-branch-1.patch, HIVE-13754.1-branch-1.patch, 
> HIVE-13754.1.patch, HIVE-13754.patch
>
>
> Found that the {{users}} reference count can go into negative values, which 
> prevents {{tearDownIfUnused}} from closing the client connection when called.
> This leads to a build up of clients which have been evicted from the cache, 
> are no longer in use, but have not been shutdown.
> GC will eventually call {{finalize}}, which forcibly closes the connection 
> and cleans up the client, but I have seen as many as several hundred open 
> client connections as a result.
> The main resource for this is caused by RetryingMetaStoreClient, which will 
> call {{reconnect}} on acquire, which calls {{close}}. This will decrement 
> {{users}} to -1 on the reconnect, then acquire will increase this to 0 while 
> using it, and back to -1 when it releases it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13754) Fix resource leak in HiveClientCache

2016-08-10 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15415599#comment-15415599
 ] 

Mithun Radhakrishnan commented on HIVE-13754:
-

Committed to master. Thanks, [~cdrome]!

> Fix resource leak in HiveClientCache
> 
>
> Key: HIVE-13754
> URL: https://issues.apache.org/jira/browse/HIVE-13754
> Project: Hive
>  Issue Type: Bug
>  Components: Clients
>Affects Versions: 1.2.1, 2.0.0
>Reporter: Chris Drome
>Assignee: Chris Drome
> Fix For: 2.0.0
>
> Attachments: HIVE-13754-branch-1.patch, HIVE-13754.1-branch-1.patch, 
> HIVE-13754.1.patch, HIVE-13754.patch
>
>
> Found that the {{users}} reference count can go into negative values, which 
> prevents {{tearDownIfUnused}} from closing the client connection when called.
> This leads to a build up of clients which have been evicted from the cache, 
> are no longer in use, but have not been shutdown.
> GC will eventually call {{finalize}}, which forcibly closes the connection 
> and cleans up the client, but I have seen as many as several hundred open 
> client connections as a result.
> The main resource for this is caused by RetryingMetaStoreClient, which will 
> call {{reconnect}} on acquire, which calls {{close}}. This will decrement 
> {{users}} to -1 on the reconnect, then acquire will increase this to 0 while 
> using it, and back to -1 when it releases it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13756) Map failure attempts to delete reducer _temporary directory on multi-query pig query

2016-08-10 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15415574#comment-15415574
 ] 

Mithun Radhakrishnan commented on HIVE-13756:
-

Committed to master. Thanks, [~cdrome]!

> Map failure attempts to delete reducer _temporary directory on multi-query 
> pig query
> 
>
> Key: HIVE-13756
> URL: https://issues.apache.org/jira/browse/HIVE-13756
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.1, 2.0.0
>Reporter: Chris Drome
>Assignee: Chris Drome
> Attachments: HIVE-13756-branch-1.patch, HIVE-13756.1-branch-1.patch, 
> HIVE-13756.1.patch, HIVE-13756.patch
>
>
> A pig script, executed with multi-query enabled, that reads the source data 
> and writes it as-is into TABLE_A as well as performing a group-by operation 
> on the data which is written into TABLE_B can produce erroneous results if 
> any map fails. This results in a single MR job that writes the map output to 
> a scratch directory relative to TABLE_A and the reducer output to a scratch 
> directory relative to TABLE_B.
> If one or more maps fail it will delete the attempt data relative to TABLE_A, 
> but it also deletes the _temporary directory relative to TABLE_B. This has 
> the unintended side-effect of preventing subsequent maps from committing 
> their data. This means that any maps which successfully completed before the 
> first map failure will have its data committed as expected, other maps not, 
> resulting in an incomplete result set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11693) CommonMergeJoinOperator throws exception with tez

2016-08-11 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417699#comment-15417699
 ] 

Mithun Radhakrishnan commented on HIVE-11693:
-

[~selinazh], could we please post our solution to this JIRA?

> CommonMergeJoinOperator throws exception with tez
> -
>
> Key: HIVE-11693
> URL: https://issues.apache.org/jira/browse/HIVE-11693
> Project: Hive
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Selina Zhang
> Attachments: HIVE-11693.1.patch
>
>
> Got this when executing a simple query with latest hive build + tez latest 
> version.
> {noformat}
> Error: Failure while running task: 
> attempt_1439860407967_0291_2_03_45_0:java.lang.RuntimeException: 
> java.lang.RuntimeException: Hive Runtime Error while closing operators: 
> java.lang.RuntimeException: java.io.IOException: Please check if you are 
> invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
> at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
> at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:349)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:71)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:60)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:60)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:35)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Hive Runtime Error while closing 
> operators: java.lang.RuntimeException: java.io.IOException: Please check if 
> you are invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:316)
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162)
> ... 14 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.RuntimeException: java.io.IOException: Please check if you are 
> invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:412)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchNextGroup(CommonMergeJoinOperator.java:375)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.doFirstFetchIfNeeded(CommonMergeJoinOperator.java:482)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinFinalLeftData(CommonMergeJoinOperator.java:434)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.closeOp(CommonMergeJoinOperator.java:384)
> at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:616)
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:292)
> ... 15 more
> Caused by: java.lang.RuntimeException: java.io.IOException: Please check if 
> you are invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:291)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:400)
> ... 21 more
> Caused by: java.io.IOException: Please check if you are invoking moveToNext() 
> even after it returned false.
> at 
> org.apache.tez.runtime.library.common.ValuesIterator.hasCompletedProcessing(ValuesIterator.java:223)
> at 
> org.apache.tez.runtime.library.common.ValuesIterator.moveToNext(ValuesIterator.java:105)
> at 
> org.apache.tez.runtime.library.input.OrderedGroupedKVInput$OrderedGroupedKeyValuesReader.next(OrderedGroupedKVInput.java:308)
> at 
> org.apache.hadoop.hive.ql.exec.tez.KeyValuesFromKeyValues.next(KeyValuesFromKeyValues.java:46)
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:249)
> ... 22 more
> {noformat}
> Not sure if this is related to HIVE-11016. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14380) Queries on tables with remote HDFS paths fail in "encryption" checks.

2016-08-03 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406177#comment-15406177
 ] 

Mithun Radhakrishnan commented on HIVE-14380:
-

Thank you very much, [~spena]. I have a related fix on the metastore 
server-side. I hope to make time to raise a JIRA for this soon.

> Queries on tables with remote HDFS paths fail in "encryption" checks.
> -
>
> Key: HIVE-14380
> URL: https://issues.apache.org/jira/browse/HIVE-14380
> Project: Hive
>  Issue Type: Bug
>  Components: Encryption
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Fix For: 2.2.0
>
> Attachments: HIVE-14380.1.patch
>
>
> If a table has table/partition locations set to remote HDFS paths, querying 
> them will cause the following IAException:
> {noformat}
> 2016-07-26 01:16:27,471 ERROR parse.CalcitePlanner 
> (SemanticAnalyzer.java:getMetaData(1867)) - 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table is encrypted: 
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table, expected: 
> hdfs://bar.ygrid.yahoo.com:8020
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:2204)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:2274)
> ...
> {noformat}
> This is because of the following code in {{SessionState}}:
> {code:title=SessionState.java|borderStyle=solid}
>  public HadoopShims.HdfsEncryptionShim getHdfsEncryptionShim() throws 
> HiveException {
> if (hdfsEncryptionShim == null) {
>   try {
> FileSystem fs = FileSystem.get(sessionConf);
> if ("hdfs".equals(fs.getUri().getScheme())) {
>   hdfsEncryptionShim = 
> ShimLoader.getHadoopShims().createHdfsEncryptionShim(fs, sessionConf);
> } else {
>   LOG.debug("Could not get hdfsEncryptionShim, it is only applicable 
> to hdfs filesystem.");
> }
>   } catch (Exception e) {
> throw new HiveException(e);
>   }
> }
> return hdfsEncryptionShim;
>   }
> {code}
> When the {{FileSystem}} instance is created, using the {{sessionConf}} 
> implies that the current HDFS is going to be used. This call should instead 
> fetch the {{FileSystem}} instance corresponding to the path being checked.
> A fix is forthcoming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-13754) Fix resource leak in HiveClientCache

2016-08-11 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-13754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-13754:

Fix Version/s: (was: 2.0.0)
   2.2.0

> Fix resource leak in HiveClientCache
> 
>
> Key: HIVE-13754
> URL: https://issues.apache.org/jira/browse/HIVE-13754
> Project: Hive
>  Issue Type: Bug
>  Components: Clients
>Affects Versions: 1.2.1, 2.0.0
>Reporter: Chris Drome
>Assignee: Chris Drome
> Fix For: 2.2.0
>
> Attachments: HIVE-13754-branch-1.patch, HIVE-13754.1-branch-1.patch, 
> HIVE-13754.1.patch, HIVE-13754.patch
>
>
> Found that the {{users}} reference count can go into negative values, which 
> prevents {{tearDownIfUnused}} from closing the client connection when called.
> This leads to a build up of clients which have been evicted from the cache, 
> are no longer in use, but have not been shutdown.
> GC will eventually call {{finalize}}, which forcibly closes the connection 
> and cleans up the client, but I have seen as many as several hundred open 
> client connections as a result.
> The main resource for this is caused by RetryingMetaStoreClient, which will 
> call {{reconnect}} on acquire, which calls {{close}}. This will decrement 
> {{users}} to -1 on the reconnect, then acquire will increase this to 0 while 
> using it, and back to -1 when it releases it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13754) Fix resource leak in HiveClientCache

2016-08-11 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15416684#comment-15416684
 ] 

Mithun Radhakrishnan commented on HIVE-13754:
-

Right you are, [~leftylev]. I've fixed (aha!) the fix version.

> Fix resource leak in HiveClientCache
> 
>
> Key: HIVE-13754
> URL: https://issues.apache.org/jira/browse/HIVE-13754
> Project: Hive
>  Issue Type: Bug
>  Components: Clients
>Affects Versions: 1.2.1, 2.0.0
>Reporter: Chris Drome
>Assignee: Chris Drome
> Fix For: 2.2.0
>
> Attachments: HIVE-13754-branch-1.patch, HIVE-13754.1-branch-1.patch, 
> HIVE-13754.1.patch, HIVE-13754.patch
>
>
> Found that the {{users}} reference count can go into negative values, which 
> prevents {{tearDownIfUnused}} from closing the client connection when called.
> This leads to a build up of clients which have been evicted from the cache, 
> are no longer in use, but have not been shutdown.
> GC will eventually call {{finalize}}, which forcibly closes the connection 
> and cleans up the client, but I have seen as many as several hundred open 
> client connections as a result.
> The main resource for this is caused by RetryingMetaStoreClient, which will 
> call {{reconnect}} on acquire, which calls {{close}}. This will decrement 
> {{users}} to -1 on the reconnect, then acquire will increase this to 0 while 
> using it, and back to -1 when it releases it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-13756) Map failure attempts to delete reducer _temporary directory on multi-query pig query

2016-08-11 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-13756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-13756:

Fix Version/s: (was: 2.0.0)
   2.2.0

> Map failure attempts to delete reducer _temporary directory on multi-query 
> pig query
> 
>
> Key: HIVE-13756
> URL: https://issues.apache.org/jira/browse/HIVE-13756
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.1, 2.0.0
>Reporter: Chris Drome
>Assignee: Chris Drome
> Fix For: 2.2.0
>
> Attachments: HIVE-13756-branch-1.patch, HIVE-13756.1-branch-1.patch, 
> HIVE-13756.1.patch, HIVE-13756.patch
>
>
> A pig script, executed with multi-query enabled, that reads the source data 
> and writes it as-is into TABLE_A as well as performing a group-by operation 
> on the data which is written into TABLE_B can produce erroneous results if 
> any map fails. This results in a single MR job that writes the map output to 
> a scratch directory relative to TABLE_A and the reducer output to a scratch 
> directory relative to TABLE_B.
> If one or more maps fail it will delete the attempt data relative to TABLE_A, 
> but it also deletes the _temporary directory relative to TABLE_B. This has 
> the unintended side-effect of preventing subsequent maps from committing 
> their data. This means that any maps which successfully completed before the 
> first map failure will have its data committed as expected, other maps not, 
> resulting in an incomplete result set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-14274) When columns are added to structs in a Hive table, HCatLoader breaks.

2016-07-18 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-14274:

Status: Patch Available  (was: Open)

> When columns are added to structs in a Hive table, HCatLoader breaks.
> -
>
> Key: HIVE-14274
> URL: https://issues.apache.org/jira/browse/HIVE-14274
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 2.1.0, 1.2.1
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14274.1.patch
>
>
> Consider this sequence of table/partition creation and schema evolution:
> {code:sql}
> -- Create table.
> CREATE EXTERNAL TABLE `simple_text` (
> foo STRING,
> bar STRUCT
> )
> PARTITIONED BY ( dt STRING )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> COLLECTION ITEMS TERMINATED BY ':'
> STORED AS TEXTFILE ;
> -- Add partition.
> ALTER TABLE simple_text ADD PARTITION ( dt='0' );
> -- Alter the struct-column to add a new sub-field.
> ALTER TABLE simple_text CHANGE COLUMN bar bar STRUCT zoo:STRING>;
> {code}
> The {{dt='0'}} partition's schema indicates 2 fields in {{bar}}. The data can 
> be read using Hive, but not through HCatLoader. The error looks as follows:
> {noformat}
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
> while executing (Name: data_raw: 
> Store(hdfs://dilithiumblue-nn1.blue.ygrid.yahoo.com:8020/tmp/temp-643668868/tmp-1639945319:org.apache.pig.impl.io.TFileStorage)
>  - scope-1 Operator Key: scope-1): 
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 6018: Error 
> converting read value to tuple
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POStoreTez.getNextTuple(POStoreTez.java:123)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.runPipeline(PigProcessor.java:376)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.run(PigProcessor.java:241)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:362)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 6018: Error 
> converting read value to tuple
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POSimpleTezLoad.getNextTuple(POSimpleTezLoad.java:160)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:305)
>   ... 16 more
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 6018: 
> Error converting read value to tuple
>   at 
> org.apache.hive.hcatalog.pig.HCatBaseLoader.getNext(HCatBaseLoader.java:76)
>   at org.apache.hive.hcatalog.pig.HCatLoader.getNext(HCatLoader.java:63)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:204)
>   at 
> org.apache.tez.mapreduce.lib.MRReaderMapReduce.next(MRReaderMapReduce.java:118)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POSimpleTezLoad.getNextTuple(POSimpleTezLoad.java:140)
>   ... 17 more
> Caused by: java.lang.IndexOutOfBoundsException: Index: 2, Size: 2
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 

[jira] [Updated] (HIVE-14274) When columns are added to structs in a Hive table, HCatLoader breaks.

2016-07-18 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-14274:

Attachment: HIVE-14274.1.patch

This solution allows for columns to be added to the end of structs. It looks 
like adding support for arbitrary column-schema evolution in structs would be 
very tricky.

(Note: The solution doesn't change {{HCatRecordReader}} at all, since the 
entire struct is projected correctly by the reader.)

> When columns are added to structs in a Hive table, HCatLoader breaks.
> -
>
> Key: HIVE-14274
> URL: https://issues.apache.org/jira/browse/HIVE-14274
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.1, 2.1.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14274.1.patch
>
>
> Consider this sequence of table/partition creation and schema evolution:
> {code:sql}
> -- Create table.
> CREATE EXTERNAL TABLE `simple_text` (
> foo STRING,
> bar STRUCT
> )
> PARTITIONED BY ( dt STRING )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> COLLECTION ITEMS TERMINATED BY ':'
> STORED AS TEXTFILE ;
> -- Add partition.
> ALTER TABLE simple_text ADD PARTITION ( dt='0' );
> -- Alter the struct-column to add a new sub-field.
> ALTER TABLE simple_text CHANGE COLUMN bar bar STRUCT zoo:STRING>;
> {code}
> The {{dt='0'}} partition's schema indicates 2 fields in {{bar}}. The data can 
> be read using Hive, but not through HCatLoader. The error looks as follows:
> {noformat}
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
> while executing (Name: data_raw: 
> Store(hdfs://dilithiumblue-nn1.blue.ygrid.yahoo.com:8020/tmp/temp-643668868/tmp-1639945319:org.apache.pig.impl.io.TFileStorage)
>  - scope-1 Operator Key: scope-1): 
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 6018: Error 
> converting read value to tuple
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POStoreTez.getNextTuple(POStoreTez.java:123)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.runPipeline(PigProcessor.java:376)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.run(PigProcessor.java:241)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:362)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 6018: Error 
> converting read value to tuple
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POSimpleTezLoad.getNextTuple(POSimpleTezLoad.java:160)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:305)
>   ... 16 more
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 6018: 
> Error converting read value to tuple
>   at 
> org.apache.hive.hcatalog.pig.HCatBaseLoader.getNext(HCatBaseLoader.java:76)
>   at org.apache.hive.hcatalog.pig.HCatLoader.getNext(HCatLoader.java:63)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:204)
>   at 
> org.apache.tez.mapreduce.lib.MRReaderMapReduce.next(MRReaderMapReduce.java:118)
>   at 
> 

[jira] [Commented] (HIVE-13756) Map failure attempts to delete reducer _temporary directory on multi-query pig query

2016-07-15 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15380352#comment-15380352
 ] 

Mithun Radhakrishnan commented on HIVE-13756:
-

IMHO, the qtest failures here are irrelevant. This is a fix in HCat.

> Map failure attempts to delete reducer _temporary directory on multi-query 
> pig query
> 
>
> Key: HIVE-13756
> URL: https://issues.apache.org/jira/browse/HIVE-13756
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.1, 2.0.0
>Reporter: Chris Drome
>Assignee: Chris Drome
> Attachments: HIVE-13756-branch-1.patch, HIVE-13756.1-branch-1.patch, 
> HIVE-13756.1.patch, HIVE-13756.patch
>
>
> A pig script, executed with multi-query enabled, that reads the source data 
> and writes it as-is into TABLE_A as well as performing a group-by operation 
> on the data which is written into TABLE_B can produce erroneous results if 
> any map fails. This results in a single MR job that writes the map output to 
> a scratch directory relative to TABLE_A and the reducer output to a scratch 
> directory relative to TABLE_B.
> If one or more maps fail it will delete the attempt data relative to TABLE_A, 
> but it also deletes the _temporary directory relative to TABLE_B. This has 
> the unintended side-effect of preventing subsequent maps from committing 
> their data. This means that any maps which successfully completed before the 
> first map failure will have its data committed as expected, other maps not, 
> resulting in an incomplete result set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11693) CommonMergeJoinOperator throws exception with tez

2016-07-06 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364738#comment-15364738
 ] 

Mithun Radhakrishnan commented on HIVE-11693:
-

Kewl. Assigned to [~selinazh]. Should have our patch out soon.

> CommonMergeJoinOperator throws exception with tez
> -
>
> Key: HIVE-11693
> URL: https://issues.apache.org/jira/browse/HIVE-11693
> Project: Hive
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Selina Zhang
> Attachments: HIVE-11693.1.patch
>
>
> Got this when executing a simple query with latest hive build + tez latest 
> version.
> {noformat}
> Error: Failure while running task: 
> attempt_1439860407967_0291_2_03_45_0:java.lang.RuntimeException: 
> java.lang.RuntimeException: Hive Runtime Error while closing operators: 
> java.lang.RuntimeException: java.io.IOException: Please check if you are 
> invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
> at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
> at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:349)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:71)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:60)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:60)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:35)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Hive Runtime Error while closing 
> operators: java.lang.RuntimeException: java.io.IOException: Please check if 
> you are invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:316)
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162)
> ... 14 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.RuntimeException: java.io.IOException: Please check if you are 
> invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:412)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchNextGroup(CommonMergeJoinOperator.java:375)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.doFirstFetchIfNeeded(CommonMergeJoinOperator.java:482)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinFinalLeftData(CommonMergeJoinOperator.java:434)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.closeOp(CommonMergeJoinOperator.java:384)
> at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:616)
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:292)
> ... 15 more
> Caused by: java.lang.RuntimeException: java.io.IOException: Please check if 
> you are invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:291)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:400)
> ... 21 more
> Caused by: java.io.IOException: Please check if you are invoking moveToNext() 
> even after it returned false.
> at 
> org.apache.tez.runtime.library.common.ValuesIterator.hasCompletedProcessing(ValuesIterator.java:223)
> at 
> org.apache.tez.runtime.library.common.ValuesIterator.moveToNext(ValuesIterator.java:105)
> at 
> org.apache.tez.runtime.library.input.OrderedGroupedKVInput$OrderedGroupedKeyValuesReader.next(OrderedGroupedKVInput.java:308)
> at 
> org.apache.hadoop.hive.ql.exec.tez.KeyValuesFromKeyValues.next(KeyValuesFromKeyValues.java:46)
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:249)
> ... 22 more
> {noformat}
> Not sure if this is related to HIVE-11016. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-11693) CommonMergeJoinOperator throws exception with tez

2016-07-06 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-11693:

Assignee: Selina Zhang

> CommonMergeJoinOperator throws exception with tez
> -
>
> Key: HIVE-11693
> URL: https://issues.apache.org/jira/browse/HIVE-11693
> Project: Hive
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Selina Zhang
> Attachments: HIVE-11693.1.patch
>
>
> Got this when executing a simple query with latest hive build + tez latest 
> version.
> {noformat}
> Error: Failure while running task: 
> attempt_1439860407967_0291_2_03_45_0:java.lang.RuntimeException: 
> java.lang.RuntimeException: Hive Runtime Error while closing operators: 
> java.lang.RuntimeException: java.io.IOException: Please check if you are 
> invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
> at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
> at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:349)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:71)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:60)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:60)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:35)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Hive Runtime Error while closing 
> operators: java.lang.RuntimeException: java.io.IOException: Please check if 
> you are invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:316)
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162)
> ... 14 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.RuntimeException: java.io.IOException: Please check if you are 
> invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:412)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchNextGroup(CommonMergeJoinOperator.java:375)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.doFirstFetchIfNeeded(CommonMergeJoinOperator.java:482)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinFinalLeftData(CommonMergeJoinOperator.java:434)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.closeOp(CommonMergeJoinOperator.java:384)
> at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:616)
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:292)
> ... 15 more
> Caused by: java.lang.RuntimeException: java.io.IOException: Please check if 
> you are invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:291)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:400)
> ... 21 more
> Caused by: java.io.IOException: Please check if you are invoking moveToNext() 
> even after it returned false.
> at 
> org.apache.tez.runtime.library.common.ValuesIterator.hasCompletedProcessing(ValuesIterator.java:223)
> at 
> org.apache.tez.runtime.library.common.ValuesIterator.moveToNext(ValuesIterator.java:105)
> at 
> org.apache.tez.runtime.library.input.OrderedGroupedKVInput$OrderedGroupedKeyValuesReader.next(OrderedGroupedKVInput.java:308)
> at 
> org.apache.hadoop.hive.ql.exec.tez.KeyValuesFromKeyValues.next(KeyValuesFromKeyValues.java:46)
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:249)
> ... 22 more
> {noformat}
> Not sure if this is related to HIVE-11016. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11693) CommonMergeJoinOperator throws exception with tez

2016-07-06 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364709#comment-15364709
 ] 

Mithun Radhakrishnan commented on HIVE-11693:
-

[~rajesh.balamohan], et al., [~selinazh]'s analysis here seems accurate. 
Wouldn't her suggestion (i.e. to move {{posBigTable = (byte) 
conf.getBigTablePosition();}} to {{initializeOp()}}) fix the problem?

Would anyone else like to comment?

> CommonMergeJoinOperator throws exception with tez
> -
>
> Key: HIVE-11693
> URL: https://issues.apache.org/jira/browse/HIVE-11693
> Project: Hive
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
> Attachments: HIVE-11693.1.patch
>
>
> Got this when executing a simple query with latest hive build + tez latest 
> version.
> {noformat}
> Error: Failure while running task: 
> attempt_1439860407967_0291_2_03_45_0:java.lang.RuntimeException: 
> java.lang.RuntimeException: Hive Runtime Error while closing operators: 
> java.lang.RuntimeException: java.io.IOException: Please check if you are 
> invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
> at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
> at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:349)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:71)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:60)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:60)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:35)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Hive Runtime Error while closing 
> operators: java.lang.RuntimeException: java.io.IOException: Please check if 
> you are invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:316)
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162)
> ... 14 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.RuntimeException: java.io.IOException: Please check if you are 
> invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:412)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchNextGroup(CommonMergeJoinOperator.java:375)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.doFirstFetchIfNeeded(CommonMergeJoinOperator.java:482)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinFinalLeftData(CommonMergeJoinOperator.java:434)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.closeOp(CommonMergeJoinOperator.java:384)
> at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:616)
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:292)
> ... 15 more
> Caused by: java.lang.RuntimeException: java.io.IOException: Please check if 
> you are invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:291)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:400)
> ... 21 more
> Caused by: java.io.IOException: Please check if you are invoking moveToNext() 
> even after it returned false.
> at 
> org.apache.tez.runtime.library.common.ValuesIterator.hasCompletedProcessing(ValuesIterator.java:223)
> at 
> org.apache.tez.runtime.library.common.ValuesIterator.moveToNext(ValuesIterator.java:105)
> at 
> org.apache.tez.runtime.library.input.OrderedGroupedKVInput$OrderedGroupedKeyValuesReader.next(OrderedGroupedKVInput.java:308)
> at 
> org.apache.hadoop.hive.ql.exec.tez.KeyValuesFromKeyValues.next(KeyValuesFromKeyValues.java:46)
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:249)
> ... 22 more
> {noformat}
> Not sure if this is related to HIVE-11016. 



--
This message was 

[jira] [Commented] (HIVE-13754) Fix resource leak in HiveClientCache

2016-08-08 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412206#comment-15412206
 ] 

Mithun Radhakrishnan commented on HIVE-13754:
-

+1. The test failures don't seem related to the fix.

> Fix resource leak in HiveClientCache
> 
>
> Key: HIVE-13754
> URL: https://issues.apache.org/jira/browse/HIVE-13754
> Project: Hive
>  Issue Type: Bug
>  Components: Clients
>Affects Versions: 1.2.1, 2.0.0
>Reporter: Chris Drome
>Assignee: Chris Drome
> Attachments: HIVE-13754-branch-1.patch, HIVE-13754.1-branch-1.patch, 
> HIVE-13754.1.patch, HIVE-13754.patch
>
>
> Found that the {{users}} reference count can go into negative values, which 
> prevents {{tearDownIfUnused}} from closing the client connection when called.
> This leads to a build up of clients which have been evicted from the cache, 
> are no longer in use, but have not been shutdown.
> GC will eventually call {{finalize}}, which forcibly closes the connection 
> and cleans up the client, but I have seen as many as several hundred open 
> client connections as a result.
> The main resource for this is caused by RetryingMetaStoreClient, which will 
> call {{reconnect}} on acquire, which calls {{close}}. This will decrement 
> {{users}} to -1 on the reconnect, then acquire will increase this to 0 while 
> using it, and back to -1 when it releases it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14380) Queries on tables with remote HDFS paths fail in "encryption" checks.

2016-08-01 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403411#comment-15403411
 ] 

Mithun Radhakrishnan commented on HIVE-14380:
-

I'll confirm, but I think these failures might be unrelated.

> Queries on tables with remote HDFS paths fail in "encryption" checks.
> -
>
> Key: HIVE-14380
> URL: https://issues.apache.org/jira/browse/HIVE-14380
> Project: Hive
>  Issue Type: Bug
>  Components: Encryption
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14380.1.patch
>
>
> If a table has table/partition locations set to remote HDFS paths, querying 
> them will cause the following IAException:
> {noformat}
> 2016-07-26 01:16:27,471 ERROR parse.CalcitePlanner 
> (SemanticAnalyzer.java:getMetaData(1867)) - 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table is encrypted: 
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table, expected: 
> hdfs://bar.ygrid.yahoo.com:8020
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:2204)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:2274)
> ...
> {noformat}
> This is because of the following code in {{SessionState}}:
> {code:title=SessionState.java|borderStyle=solid}
>  public HadoopShims.HdfsEncryptionShim getHdfsEncryptionShim() throws 
> HiveException {
> if (hdfsEncryptionShim == null) {
>   try {
> FileSystem fs = FileSystem.get(sessionConf);
> if ("hdfs".equals(fs.getUri().getScheme())) {
>   hdfsEncryptionShim = 
> ShimLoader.getHadoopShims().createHdfsEncryptionShim(fs, sessionConf);
> } else {
>   LOG.debug("Could not get hdfsEncryptionShim, it is only applicable 
> to hdfs filesystem.");
> }
>   } catch (Exception e) {
> throw new HiveException(e);
>   }
> }
> return hdfsEncryptionShim;
>   }
> {code}
> When the {{FileSystem}} instance is created, using the {{sessionConf}} 
> implies that the current HDFS is going to be used. This call should instead 
> fetch the {{FileSystem}} instance corresponding to the path being checked.
> A fix is forthcoming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14380) Queries on tables with remote HDFS paths fail in "encryption" checks.

2016-08-02 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403482#comment-15403482
 ] 

Mithun Radhakrishnan commented on HIVE-14380:
-

Yeah, looks like these tests are busted on master. :/ Just checked on a fresh 
checkout.

All except {{TestHiveMetaStoreTxns}}. That test seems to run for me (even with 
my patch applied).

> Queries on tables with remote HDFS paths fail in "encryption" checks.
> -
>
> Key: HIVE-14380
> URL: https://issues.apache.org/jira/browse/HIVE-14380
> Project: Hive
>  Issue Type: Bug
>  Components: Encryption
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14380.1.patch
>
>
> If a table has table/partition locations set to remote HDFS paths, querying 
> them will cause the following IAException:
> {noformat}
> 2016-07-26 01:16:27,471 ERROR parse.CalcitePlanner 
> (SemanticAnalyzer.java:getMetaData(1867)) - 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table is encrypted: 
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table, expected: 
> hdfs://bar.ygrid.yahoo.com:8020
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:2204)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:2274)
> ...
> {noformat}
> This is because of the following code in {{SessionState}}:
> {code:title=SessionState.java|borderStyle=solid}
>  public HadoopShims.HdfsEncryptionShim getHdfsEncryptionShim() throws 
> HiveException {
> if (hdfsEncryptionShim == null) {
>   try {
> FileSystem fs = FileSystem.get(sessionConf);
> if ("hdfs".equals(fs.getUri().getScheme())) {
>   hdfsEncryptionShim = 
> ShimLoader.getHadoopShims().createHdfsEncryptionShim(fs, sessionConf);
> } else {
>   LOG.debug("Could not get hdfsEncryptionShim, it is only applicable 
> to hdfs filesystem.");
> }
>   } catch (Exception e) {
> throw new HiveException(e);
>   }
> }
> return hdfsEncryptionShim;
>   }
> {code}
> When the {{FileSystem}} instance is created, using the {{sessionConf}} 
> implies that the current HDFS is going to be used. This call should instead 
> fetch the {{FileSystem}} instance corresponding to the path being checked.
> A fix is forthcoming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14380) Queries on tables with remote HDFS paths fail in "encryption" checks.

2016-07-29 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15399853#comment-15399853
 ] 

Mithun Radhakrishnan commented on HIVE-14380:
-

Thanks for reviewing, [~spena]. :] Also, yikes. I'm not sure how 2 JIRAs got 
raised. :/

> Queries on tables with remote HDFS paths fail in "encryption" checks.
> -
>
> Key: HIVE-14380
> URL: https://issues.apache.org/jira/browse/HIVE-14380
> Project: Hive
>  Issue Type: Bug
>  Components: Encryption
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14380.1.patch
>
>
> If a table has table/partition locations set to remote HDFS paths, querying 
> them will cause the following IAException:
> {noformat}
> 2016-07-26 01:16:27,471 ERROR parse.CalcitePlanner 
> (SemanticAnalyzer.java:getMetaData(1867)) - 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table is encrypted: 
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table, expected: 
> hdfs://bar.ygrid.yahoo.com:8020
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:2204)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:2274)
> ...
> {noformat}
> This is because of the following code in {{SessionState}}:
> {code:title=SessionState.java|borderStyle=solid}
>  public HadoopShims.HdfsEncryptionShim getHdfsEncryptionShim() throws 
> HiveException {
> if (hdfsEncryptionShim == null) {
>   try {
> FileSystem fs = FileSystem.get(sessionConf);
> if ("hdfs".equals(fs.getUri().getScheme())) {
>   hdfsEncryptionShim = 
> ShimLoader.getHadoopShims().createHdfsEncryptionShim(fs, sessionConf);
> } else {
>   LOG.debug("Could not get hdfsEncryptionShim, it is only applicable 
> to hdfs filesystem.");
> }
>   } catch (Exception e) {
> throw new HiveException(e);
>   }
> }
> return hdfsEncryptionShim;
>   }
> {code}
> When the {{FileSystem}} instance is created, using the {{sessionConf}} 
> implies that the current HDFS is going to be used. This call should instead 
> fetch the {{FileSystem}} instance corresponding to the path being checked.
> A fix is forthcoming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13756) Map failure attempts to delete reducer _temporary directory on multi-query pig query

2016-08-09 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15414342#comment-15414342
 ] 

Mithun Radhakrishnan commented on HIVE-13756:
-

+1. 

> Map failure attempts to delete reducer _temporary directory on multi-query 
> pig query
> 
>
> Key: HIVE-13756
> URL: https://issues.apache.org/jira/browse/HIVE-13756
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.1, 2.0.0
>Reporter: Chris Drome
>Assignee: Chris Drome
> Attachments: HIVE-13756-branch-1.patch, HIVE-13756.1-branch-1.patch, 
> HIVE-13756.1.patch, HIVE-13756.patch
>
>
> A pig script, executed with multi-query enabled, that reads the source data 
> and writes it as-is into TABLE_A as well as performing a group-by operation 
> on the data which is written into TABLE_B can produce erroneous results if 
> any map fails. This results in a single MR job that writes the map output to 
> a scratch directory relative to TABLE_A and the reducer output to a scratch 
> directory relative to TABLE_B.
> If one or more maps fail it will delete the attempt data relative to TABLE_A, 
> but it also deletes the _temporary directory relative to TABLE_B. This has 
> the unintended side-effect of preventing subsequent maps from committing 
> their data. This means that any maps which successfully completed before the 
> first map failure will have its data committed as expected, other maps not, 
> resulting in an incomplete result set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14274) When columns are added to structs in a Hive table, HCatLoader breaks.

2017-02-21 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876489#comment-15876489
 ] 

Mithun Radhakrishnan commented on HIVE-14274:
-

bq. Any one any ideas? 

Yes. :] The patch I have up on this JIRA should allow for the addition of 
columns to the end of a struct. Supporting the general case will be a very 
large change to HCatalog.

> When columns are added to structs in a Hive table, HCatLoader breaks.
> -
>
> Key: HIVE-14274
> URL: https://issues.apache.org/jira/browse/HIVE-14274
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.1, 2.1.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14274.1.patch
>
>
> Consider this sequence of table/partition creation and schema evolution:
> {code:sql}
> -- Create table.
> CREATE EXTERNAL TABLE `simple_text` (
> foo STRING,
> bar STRUCT
> )
> PARTITIONED BY ( dt STRING )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> COLLECTION ITEMS TERMINATED BY ':'
> STORED AS TEXTFILE ;
> -- Add partition.
> ALTER TABLE simple_text ADD PARTITION ( dt='0' );
> -- Alter the struct-column to add a new sub-field.
> ALTER TABLE simple_text CHANGE COLUMN bar bar STRUCT zoo:STRING>;
> {code}
> The {{dt='0'}} partition's schema indicates 2 fields in {{bar}}. The data can 
> be read using Hive, but not through HCatLoader. The error looks as follows:
> {noformat}
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
> while executing (Name: data_raw: 
> Store(hdfs://dilithiumblue-nn1.blue.ygrid.yahoo.com:8020/tmp/temp-643668868/tmp-1639945319:org.apache.pig.impl.io.TFileStorage)
>  - scope-1 Operator Key: scope-1): 
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 6018: Error 
> converting read value to tuple
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POStoreTez.getNextTuple(POStoreTez.java:123)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.runPipeline(PigProcessor.java:376)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.run(PigProcessor.java:241)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:362)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 6018: Error 
> converting read value to tuple
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POSimpleTezLoad.getNextTuple(POSimpleTezLoad.java:160)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:305)
>   ... 16 more
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 6018: 
> Error converting read value to tuple
>   at 
> org.apache.hive.hcatalog.pig.HCatBaseLoader.getNext(HCatBaseLoader.java:76)
>   at org.apache.hive.hcatalog.pig.HCatLoader.getNext(HCatLoader.java:63)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:204)
>   at 
> org.apache.tez.mapreduce.lib.MRReaderMapReduce.next(MRReaderMapReduce.java:118)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POSimpleTezLoad.getNextTuple(POSimpleTezLoad.java:140)
>   ... 

[jira] [Assigned] (HIVE-14789) Avro Table-reads bork when using SerDe-generated table-schema.

2016-09-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan reassigned HIVE-14789:
---

Assignee: Mithun Radhakrishnan

> Avro Table-reads bork when using SerDe-generated table-schema.
> --
>
> Key: HIVE-14789
> URL: https://issues.apache.org/jira/browse/HIVE-14789
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Affects Versions: 1.2.1, 2.0.1
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
>
> AvroSerDe allows one to skip the table-columns in a table-definition when 
> creating a table, as long as the TBLPROPERTIES includes a valid 
> {{avro.schema.url}} or {{avro.schema.literal}}. The table-columns are 
> inferred from processing the Avro schema file/literal.
> The problem is that the inferred schema might not be congruent with the 
> actual schema in the Avro schema file/literal. Consider the following table 
> definition:
> {code:sql}
> CREATE TABLE avro_schema_break_1
> ROW FORMAT
> SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> STORED AS
> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> TBLPROPERTIES ('avro.schema.literal'='{
>   "type": "record",
>   "name": "Messages",
>   "namespace": "net.myth",
>   "fields": [
> {
>   "name": "header",
>   "type": [
> "null",
> {
>   "type": "record",
>   "name": "HeaderInfo",
>   "fields": [
> {
>   "name": "inferred_event_type",
>   "type": [
> "null",
> "string"
>   ],
>   "default": null
> },
> {
>   "name": "event_type",
>   "type": [
> "null",
> "string"
>   ],
>   "default": null
> },
> {
>   "name": "event_version",
>   "type": [
> "null",
> "string"
>   ],
>   "default": null
> }
>   ]
> }
>   ]
> },
> {
>   "name": "messages",
>   "type": {
> "type": "array",
> "items": {
>   "name": "MessageInfo",
>   "type": "record",
>   "fields": [
> {
>   "name": "message_id",
>   "type": [
> "null",
> "string"
>   ],
>   "doc": "Message-ID"
> },
> {
>   "name": "received_date",
>   "type": [
> "null",
> "long"
>   ],
>   "doc": "Received Date"
> },
> {
>   "name": "sent_date",
>   "type": [
> "null",
> "long"
>   ]
> },
> {
>   "name": "from_name",
>   "type": [
> "null",
> "string"
>   ]
> },
> {
>   "name": "flags",
>   "type": [
> "null",
> {
>   "type": "record",
>   "name": "Flags",
>   "fields": [
> {
>   "name": "is_seen",
>   "type": [
> "null",
> "boolean"
>   ],
>   "default": null
> },
> {
>   "name": "is_read",
>   "type": [
> "null",
> "boolean"
>   ],
>   "default": null
> },
> {
>   "name": "is_flagged",
>   "type": [
> "null",
> "boolean"
>   ],
>   "default": null
> }
>   ]
> }
>   ],
>   "default": null
> }
>   ]
> }
>   }
> }
>   ]
> }');
> {code}
> This produces a table with the following schema:
> {noformat}
> 2016-09-19T13:23:42,934 DEBUG [0ce7e586-13ea-4390-ac2a-6dac36e8a216 main] 
> hive.log: DDL: struct avro_schema_break_1 { 
> struct 
> header, 
> list>>
>  messages}
> {noformat}
> Data written to 

[jira] [Updated] (HIVE-14794) HCatalog support to pre-fetch schema for Avro tables that use avro.schema.url.

2016-09-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-14794:

Attachment: HIVE-14794.1.patch

This patch builds on HIVE-14792. It uses {{SpecialCases}} to prefetch Avro 
schema.

> HCatalog support to pre-fetch schema for Avro tables that use avro.schema.url.
> --
>
> Key: HIVE-14794
> URL: https://issues.apache.org/jira/browse/HIVE-14794
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.1, 2.1.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14794.1.patch
>
>
> HIVE-14792 introduces support to modify and add properties to 
> table-parameters during query-planning. It prefetches remote Avro-schema 
> information and stores it in TBLPROPERTIES, under {{avro.schema.literal}}.
> We'll need similar support in {{HCatLoader}} to prevent excessive reads of 
> schema-files in Pig queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   6   7   >