[jira] [Commented] (HIVE-15575) ALTER TABLE CONCATENATE and hive.merge.tezfiles seems busted for UNION ALL output

2017-01-11 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818351#comment-15818351
 ] 

Rohini Palaniswamy commented on HIVE-15575:
---

The concept of VertexGroups was added in Tez specifically for the case of union 
to support writing to same directory from different vertices. Inclusion of 
vertex id and output id in the part file name avoids any file name conflicts 
causing overwrites. So sub-directories should not be required to implement 
union with Tez.

> ALTER TABLE CONCATENATE and hive.merge.tezfiles seems busted for UNION ALL 
> output
> -
>
> Key: HIVE-15575
> URL: https://issues.apache.org/jira/browse/HIVE-15575
> Project: Hive
>  Issue Type: Bug
>Reporter: Mithun Radhakrishnan
>Priority: Critical
>
> Hive {{UNION ALL}} produces data in sub-directories under the table/partition 
> directories. E.g.
> {noformat}
> hive (mythdb_hadooppf_17544)> create table source ( foo string, bar string, 
> goo string ) stored as textfile;
> OK
> Time taken: 0.322 seconds
> hive (mythdb_hadooppf_17544)> create table results_partitioned( foo string, 
> bar string, goo string ) partitioned by ( dt string ) stored as orcfile;
> OK
> Time taken: 0.322 seconds
> hive (mythdb_hadooppf_17544)> set hive.merge.tezfiles=false; insert overwrite 
> table results_partitioned partition( dt ) select 'goo', 'bar', 'foo', '1' 
> from source UNION ALL select 'go', 'far', 'moo', '1' from source;
> ...
> Loading data to table mythdb_hadooppf_17544.results_partitioned partition 
> (dt=null)
>  Time taken for load dynamic partitions : 311
> Loading partition {dt=1}
>  Time taken for adding to write entity : 3
> OK
> Time taken: 27.659 seconds
> hive (mythdb_hadooppf_17544)> dfs -ls -R 
> /tmp/mythdb_hadooppf_17544/results_partitioned;
> drwxrwxrwt   - dfsload hdfs  0 2017-01-10 23:13 
> /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1
> drwxrwxrwt   - dfsload hdfs  0 2017-01-10 23:13 
> /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/1
> -rwxrwxrwt   3 dfsload hdfs349 2017-01-10 23:13 
> /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/1/00_0
> drwxrwxrwt   - dfsload hdfs  0 2017-01-10 23:13 
> /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/2
> -rwxrwxrwt   3 dfsload hdfs368 2017-01-10 23:13 
> /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/2/00_0
> {noformat}
> These results can only be read if {{mapred.input.dir.recursive=true}}, as 
> {{TezCompiler::init()}} seems to do. But the Hadoop default for this is 
> {{false}}. This leads to the following errors:
> 1. Running {{CONCATENATE}} on the partition on the partition causes data-loss.
> {noformat}
> hive --database mythdb_hadooppf_17544 -e " set mapred.input.dir.recursive; 
> alter table results_partitioned partition ( dt='1' ) concatenate ; set 
> mapred.input.dir.recursive; "
> ...
> OK
> Time taken: 2.151 seconds
> mapred.input.dir.recursive=false
> Status: Running (Executing on YARN cluster with App id 
> application_1481756273279_5088754)
> 
> VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
> KILLED
> 
> File Merge SUCCEEDED  0  000   0  
>  0
> 
> VERTICES: 01/01  [>>--] 0%ELAPSED TIME: 0.35 s
> 
> Loading data to table mythdb_hadooppf_17544.results_partitioned partition 
> (dt=1)
> Moved: 
> 'hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/1'
>  to trash at: 
> hdfs://cluster-nn1.mygrid.myth.net:8020/user/dfsload/.Trash/Current
> Moved: 
> 'hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/2'
>  to trash at: 
> hdfs://cluster-nn1.mygrid.myth.net:8020/user/dfsload/.Trash/Current
> OK
> Time taken: 25.873 seconds
> $ hdfs dfs -count -h /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1
>10  0 
> /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1
> {noformat}
> 2. hive.merge.tezfiles is busted, because the merge-task attempts to merge 
> files across {{results_partitioned/dt=1/1}} and 
> {{results_partitioned/dt=1/2}}:
> {noformat}
> $ hive --database mythdb_hadooppf_17544 -e " set hive.merge.tezfiles=true; 
> insert overwrite table results_partitioned partition( dt ) select 'goo', 
> 'bar', 'foo', '1' from source UNION ALL select 'go', 'far', 'moo', '1' from 
> source; "
> ...
> Query ID = 

[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path

2016-04-14 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242128#comment-15242128
 ] 

Rohini Palaniswamy commented on HIVE-13509:
---

bq. hcat.input.ignore.invalid.path but set it to default true which returns 
nothing for invalid (or empty) partition?
  Default will have to be false similar to other failure.percent setting 
examples I gave. Only someone who is messing up with their data and is ok with 
it will have to turn it on. 

> HCatalog getSplits should ignore the partition with invalid path
> 
>
> Key: HIVE-13509
> URL: https://issues.apache.org/jira/browse/HIVE-13509
> Project: Hive
>  Issue Type: Improvement
>  Components: HCatalog
>Reporter: Chaoyu Tang
>Assignee: Chaoyu Tang
> Attachments: HIVE-13509.patch
>
>
> It is quite common that there is the discrepancy between partition directory 
> and its HMS metadata, simply because the directory could be added/deleted 
> externally using hdfs shell command. Technically it should be fixed by MSCK 
> and alter table .. add/drop command etc, but sometimes it might not be 
> practical especially in a multi-tenant env. This discrepancy does not cause 
> any problem to Hive, Hive returns no rows for a partition with an invalid 
> (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because 
> the HCatBaseInputFormat getSplits throws an error when getting a split for a 
> non-existing path. The error message might looks like:
> {code}
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: 
> hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>   at 
> org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path

2016-04-14 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15241989#comment-15241989
 ] 

Rohini Palaniswamy commented on HIVE-13509:
---

IMHO, Hive should also be throwing an error as well if data does not exist 
because the results returned is incomplete and wrong. Data integrity is 
important. If some users are ok with it, then it can be a configurable option 
for them but it cannot be the default (at least with Pig). For eg: 
mapred.max.map.failures.percent and mapred.max.reduce.failures.percent are 
useful for users who are ok with tolerating some amount of failure, but default 
is 0.  Same with pig.error.threshold.percent. 

> HCatalog getSplits should ignore the partition with invalid path
> 
>
> Key: HIVE-13509
> URL: https://issues.apache.org/jira/browse/HIVE-13509
> Project: Hive
>  Issue Type: Improvement
>  Components: HCatalog
>Reporter: Chaoyu Tang
>Assignee: Chaoyu Tang
> Attachments: HIVE-13509.patch
>
>
> It is quite common that there is the discrepancy between partition directory 
> and its HMS metadata, simply because the directory could be added/deleted 
> externally using hdfs shell command. Technically it should be fixed by MSCK 
> and alter table .. add/drop command etc, but sometimes it might not be 
> practical especially in a multi-tenant env. This discrepancy does not cause 
> any problem to Hive, Hive returns no rows for a partition with an invalid 
> (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because 
> the HCatBaseInputFormat getSplits throws an error when getting a split for a 
> non-existing path. The error message might looks like:
> {code}
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: 
> hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>   at 
> org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path

2016-04-13 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240242#comment-15240242
 ] 

Rohini Palaniswamy commented on HIVE-13509:
---

bq. In ETL jobs using Pig, we might actually prefer a failure when the input 
data isn't available. Wouldn't this fix break those semantics for Pig?
  Yes. Missing data in output is not acceptable.

> HCatalog getSplits should ignore the partition with invalid path
> 
>
> Key: HIVE-13509
> URL: https://issues.apache.org/jira/browse/HIVE-13509
> Project: Hive
>  Issue Type: Improvement
>  Components: HCatalog
>Reporter: Chaoyu Tang
>Assignee: Chaoyu Tang
> Attachments: HIVE-13509.patch
>
>
> It is quite common that there is the discrepancy between partition directory 
> and its HMS metadata, simply because the directory could be added/deleted 
> externally using hdfs shell command. Technically it should be fixed by MSCK 
> and alter table .. add/drop command etc, but sometimes it might not be 
> practical especially in a multi-tenant env. This discrepancy does not cause 
> any problem to Hive, Hive returns no rows for a partition with an invalid 
> (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because 
> the HCatBaseInputFormat getSplits throws an error when getting a split for a 
> non-existing path. The error message might looks like:
> {code}
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: 
> hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>   at 
> org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)