[ 
https://issues.apache.org/jira/browse/HIVE-15575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15818351#comment-15818351
 ] 

Rohini Palaniswamy commented on HIVE-15575:
-------------------------------------------

The concept of VertexGroups was added in Tez specifically for the case of union 
to support writing to same directory from different vertices. Inclusion of 
vertex id and output id in the part file name avoids any file name conflicts 
causing overwrites. So sub-directories should not be required to implement 
union with Tez.

> ALTER TABLE CONCATENATE and hive.merge.tezfiles seems busted for UNION ALL 
> output
> ---------------------------------------------------------------------------------
>
>                 Key: HIVE-15575
>                 URL: https://issues.apache.org/jira/browse/HIVE-15575
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Mithun Radhakrishnan
>            Priority: Critical
>
> Hive {{UNION ALL}} produces data in sub-directories under the table/partition 
> directories. E.g.
> {noformat}
> hive (mythdb_hadooppf_17544)> create table source ( foo string, bar string, 
> goo string ) stored as textfile;
> OK
> Time taken: 0.322 seconds
> hive (mythdb_hadooppf_17544)> create table results_partitioned( foo string, 
> bar string, goo string ) partitioned by ( dt string ) stored as orcfile;
> OK
> Time taken: 0.322 seconds
> hive (mythdb_hadooppf_17544)> set hive.merge.tezfiles=false; insert overwrite 
> table results_partitioned partition( dt ) select 'goo', 'bar', 'foo', '1' 
> from source UNION ALL select 'go', 'far', 'moo', '1' from source;
> ...
> Loading data to table mythdb_hadooppf_17544.results_partitioned partition 
> (dt=null)
>          Time taken for load dynamic partitions : 311
>         Loading partition {dt=1}
>          Time taken for adding to write entity : 3
> OK
> Time taken: 27.659 seconds
> hive (mythdb_hadooppf_17544)> dfs -ls -R 
> /tmp/mythdb_hadooppf_17544/results_partitioned;
> drwxrwxrwt   - dfsload hdfs          0 2017-01-10 23:13 
> /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1
> drwxrwxrwt   - dfsload hdfs          0 2017-01-10 23:13 
> /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/1
> -rwxrwxrwt   3 dfsload hdfs        349 2017-01-10 23:13 
> /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/1/000000_0
> drwxrwxrwt   - dfsload hdfs          0 2017-01-10 23:13 
> /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/2
> -rwxrwxrwt   3 dfsload hdfs        368 2017-01-10 23:13 
> /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/2/000000_0
> {noformat}
> These results can only be read if {{mapred.input.dir.recursive=true}}, as 
> {{TezCompiler::init()}} seems to do. But the Hadoop default for this is 
> {{false}}. This leads to the following errors:
> 1. Running {{CONCATENATE}} on the partition on the partition causes data-loss.
> {noformat}
> hive --database mythdb_hadooppf_17544 -e " set mapred.input.dir.recursive; 
> alter table results_partitioned partition ( dt='1' ) concatenate ; set 
> mapred.input.dir.recursive; "
> ...
> OK
> Time taken: 2.151 seconds
> mapred.input.dir.recursive=false
> Status: Running (Executing on YARN cluster with App id 
> application_1481756273279_5088754)
> --------------------------------------------------------------------------------
>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
> KILLED
> --------------------------------------------------------------------------------
> File Merge         SUCCEEDED      0          0        0        0       0      
>  0
> --------------------------------------------------------------------------------
> VERTICES: 01/01  [>>--------------------------] 0%    ELAPSED TIME: 0.35 s
> --------------------------------------------------------------------------------
> Loading data to table mythdb_hadooppf_17544.results_partitioned partition 
> (dt=1)
> Moved: 
> 'hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/1'
>  to trash at: 
> hdfs://cluster-nn1.mygrid.myth.net:8020/user/dfsload/.Trash/Current
> Moved: 
> 'hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/2'
>  to trash at: 
> hdfs://cluster-nn1.mygrid.myth.net:8020/user/dfsload/.Trash/Current
> OK
> Time taken: 25.873 seconds
> $ hdfs dfs -count -h /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1
>            1            0                  0 
> /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1
> {noformat}
> 2. hive.merge.tezfiles is busted, because the merge-task attempts to merge 
> files across {{results_partitioned/dt=1/1}} and 
> {{results_partitioned/dt=1/2}}:
> {noformat}
> $ hive --database mythdb_hadooppf_17544 -e " set hive.merge.tezfiles=true; 
> insert overwrite table results_partitioned partition( dt ) select 'goo', 
> 'bar', 'foo', '1' from source UNION ALL select 'go', 'far', 'moo', '1' from 
> source; "
> ...
> Query ID = dfsload_20170110233558_51289333-d9da-4851-8671-bfe653d26e45
> Total jobs = 3
> Launching Job 1 out of 3
> Status: Running (Executing on YARN cluster with App id 
> application_1481756273279_5089989)
> --------------------------------------------------------------------------------
>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
> KILLED
> --------------------------------------------------------------------------------
> Map 1 ..........   SUCCEEDED      1          1        0        0       0      
>  0
> Map 3 ..........   SUCCEEDED      1          1        0        0       0      
>  0
> --------------------------------------------------------------------------------
> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 13.07 s
> --------------------------------------------------------------------------------
> Stage-4 is filtered out by condition resolver.
> Stage-3 is selected by condition resolver.
> Stage-5 is filtered out by condition resolver.
> Launching Job 3 out of 3
> Status: Running (Executing on YARN cluster with App id 
> application_1481756273279_5089989)
> --------------------------------------------------------------------------------
>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
> KILLED
> --------------------------------------------------------------------------------
> File Merge           RUNNING      1          0        1        0       2      
>  0
> --------------------------------------------------------------------------------
> VERTICES: 00/01  [>>--------------------------] 0%    ELAPSED TIME: 3.06 s
> --------------------------------------------------------------------------------
> ...
> {noformat}
> The {{File Merge}} fails with the following:
> {noformat}
> TaskAttempt 3 failed, info=[Error: Failure while running 
> task:java.lang.RuntimeException: java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: 
> Multiple partitions for one merge mapper: 
> hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/2
>  NOT EQUAL TO 
> hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/1
>         at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
>         at 
> org.apache.hadoop.hive.ql.exec.tez.MergeFileTezProcessor.run(MergeFileTezProcessor.java:42)
>         at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:362)
>         at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:192)
>         at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:184)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1738)
>         at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:184)
>         at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:180)
>         at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: 
> Multiple partitions for one merge mapper: 
> hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/2
>  NOT EQUAL TO 
> hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/1
>         at 
> org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.processRow(MergeFileRecordProcessor.java:217)
>         at 
> org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.run(MergeFileRecordProcessor.java:151)
>         at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:148)
>         ... 14 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: Multiple partitions for one merge mapper: 
> hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/2
>  NOT EQUAL TO 
> hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/1
>         at 
> org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.processKeyValuePairs(OrcFileMergeOperator.java:159)
>         at 
> org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.process(OrcFileMergeOperator.java:62)
>         at 
> org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.processRow(MergeFileRecordProcessor.java:208)
>         ... 16 more
> Caused by: java.io.IOException: Multiple partitions for one merge mapper: 
> hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/2
>  NOT EQUAL TO 
> hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/1
>         at 
> org.apache.hadoop.hive.ql.exec.AbstractFileMergeOperator.checkPartitionsMatch(AbstractFileMergeOperator.java:174)
>         at 
> org.apache.hadoop.hive.ql.exec.AbstractFileMergeOperator.fixTmpPath(AbstractFileMergeOperator.java:191)
>         at 
> org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.processKeyValuePairs(OrcFileMergeOperator.java:86)
>         ... 18 more
> ]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 
> killedTasks:0, Vertex vertex_1481756273279_5089989_2_00 [File Merge] 
> killed/failed due to:OWN_TASK_FAILURE]DAG did not succeed due to 
> VERTEX_FAILURE. failedVertices:1 killedVertices:0
> {noformat}
> 3. Data produced with Hive {{UNION ALL}} will not be readable by 
> Pig/HCatalog, without {{mapred.input.dir.recursive}}.
> Setting {{mapred.input.dir.recursive=true}} in {{hive-site.xml}} should 
> resolve the first and third problem. But is this the recommendation? This is 
> intrusive, and doesn't solve #2. The Pig {{UNION}} doesn't work this way, as 
> per my limited understanding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to