[jira] [Commented] (HIVE-17113) Duplicate bucket files can get written to table by runaway task
[ https://issues.apache.org/jira/browse/HIVE-17113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112279#comment-16112279 ] Lefty Leverenz commented on HIVE-17113: --- Doc note: This adds *hive.exec.move.files.from.source.dir* to HiveConf.java, so it needs to be documented in the wiki. * [Configuration Properties -- Query and DDL Execution | https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-QueryandDDLExecution] Added a TODOC3.0 label. > Duplicate bucket files can get written to table by runaway task > --- > > Key: HIVE-17113 > URL: https://issues.apache.org/jira/browse/HIVE-17113 > Project: Hive > Issue Type: Bug > Components: Query Processor >Reporter: Jason Dere >Assignee: Jason Dere > Labels: TODOC3.0 > Fix For: 3.0.0 > > Attachments: HIVE-17113.1.patch, HIVE-17113.2.patch, > HIVE-17113.3.patch > > > Saw a table get a duplicate bucket file from a Hive query. It looks like the > following happened: > 1. Task attempt A_0 starts,but then stops making progress > 2. The job was running with speculative execution on, and task attempt A_1 is > started > 3. Task attempt A_1 finishes execution and saves its output to the temp > directory. > 5. A task kill is sent to A_0, though this does appear to actually kill A_0 > 6. The job for the query finishes and Utilities.mvFileToFinalPath() calls > Utilities.removeTempOrDuplicateFiles() to check for duplicate bucket files > 7. A_0 (still running) finally finishes and saves its file to the temp > directory. At this point we now have duplicate bucket files - oops! > 8. Utilities.removeTempOrDuplicateFiles() moves the temp directory to the > final location, where it is later moved to the partition directory. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17113) Duplicate bucket files can get written to table by runaway task
[ https://issues.apache.org/jira/browse/HIVE-17113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107703#comment-16107703 ] Ashutosh Chauhan commented on HIVE-17113: - +1 > Duplicate bucket files can get written to table by runaway task > --- > > Key: HIVE-17113 > URL: https://issues.apache.org/jira/browse/HIVE-17113 > Project: Hive > Issue Type: Bug > Components: Query Processor >Reporter: Jason Dere >Assignee: Jason Dere > Attachments: HIVE-17113.1.patch, HIVE-17113.2.patch, > HIVE-17113.3.patch > > > Saw a table get a duplicate bucket file from a Hive query. It looks like the > following happened: > 1. Task attempt A_0 starts,but then stops making progress > 2. The job was running with speculative execution on, and task attempt A_1 is > started > 3. Task attempt A_1 finishes execution and saves its output to the temp > directory. > 5. A task kill is sent to A_0, though this does appear to actually kill A_0 > 6. The job for the query finishes and Utilities.mvFileToFinalPath() calls > Utilities.removeTempOrDuplicateFiles() to check for duplicate bucket files > 7. A_0 (still running) finally finishes and saves its file to the temp > directory. At this point we now have duplicate bucket files - oops! > 8. Utilities.removeTempOrDuplicateFiles() moves the temp directory to the > final location, where it is later moved to the partition directory. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17113) Duplicate bucket files can get written to table by runaway task
[ https://issues.apache.org/jira/browse/HIVE-17113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107690#comment-16107690 ] Jason Dere commented on HIVE-17113: --- [~ashutoshc] can you review this one? > Duplicate bucket files can get written to table by runaway task > --- > > Key: HIVE-17113 > URL: https://issues.apache.org/jira/browse/HIVE-17113 > Project: Hive > Issue Type: Bug > Components: Query Processor >Reporter: Jason Dere >Assignee: Jason Dere > Attachments: HIVE-17113.1.patch, HIVE-17113.2.patch, > HIVE-17113.3.patch > > > Saw a table get a duplicate bucket file from a Hive query. It looks like the > following happened: > 1. Task attempt A_0 starts,but then stops making progress > 2. The job was running with speculative execution on, and task attempt A_1 is > started > 3. Task attempt A_1 finishes execution and saves its output to the temp > directory. > 5. A task kill is sent to A_0, though this does appear to actually kill A_0 > 6. The job for the query finishes and Utilities.mvFileToFinalPath() calls > Utilities.removeTempOrDuplicateFiles() to check for duplicate bucket files > 7. A_0 (still running) finally finishes and saves its file to the temp > directory. At this point we now have duplicate bucket files - oops! > 8. Utilities.removeTempOrDuplicateFiles() moves the temp directory to the > final location, where it is later moved to the partition directory. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17113) Duplicate bucket files can get written to table by runaway task
[ https://issues.apache.org/jira/browse/HIVE-17113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102718#comment-16102718 ] Hive QA commented on HIVE-17113: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12879072/HIVE-17113.3.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 9 failed/errored test(s), 11012 tests executed *Failed tests:* {noformat} TestPerfCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=235) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[llap_smb] (batchId=144) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning] (batchId=168) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_vectorized_dynamic_partition_pruning] (batchId=168) org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_3] (batchId=99) org.apache.hadoop.hive.metastore.TestHiveMetaStoreStatsMerge.testStatsMerge (batchId=206) org.apache.hive.hcatalog.api.TestHCatClient.testPartitionRegistrationWithCustomSchema (batchId=179) org.apache.hive.hcatalog.api.TestHCatClient.testPartitionSpecRegistrationWithCustomSchema (batchId=179) org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation (batchId=179) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6142/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6142/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6142/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 9 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12879072 - PreCommit-HIVE-Build > Duplicate bucket files can get written to table by runaway task > --- > > Key: HIVE-17113 > URL: https://issues.apache.org/jira/browse/HIVE-17113 > Project: Hive > Issue Type: Bug > Components: Query Processor >Reporter: Jason Dere >Assignee: Jason Dere > Attachments: HIVE-17113.1.patch, HIVE-17113.2.patch, > HIVE-17113.3.patch > > > Saw a table get a duplicate bucket file from a Hive query. It looks like the > following happened: > 1. Task attempt A_0 starts,but then stops making progress > 2. The job was running with speculative execution on, and task attempt A_1 is > started > 3. Task attempt A_1 finishes execution and saves its output to the temp > directory. > 5. A task kill is sent to A_0, though this does appear to actually kill A_0 > 6. The job for the query finishes and Utilities.mvFileToFinalPath() calls > Utilities.removeTempOrDuplicateFiles() to check for duplicate bucket files > 7. A_0 (still running) finally finishes and saves its file to the temp > directory. At this point we now have duplicate bucket files - oops! > 8. Utilities.removeTempOrDuplicateFiles() moves the temp directory to the > final location, where it is later moved to the partition directory. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17113) Duplicate bucket files can get written to table by runaway task
[ https://issues.apache.org/jira/browse/HIVE-17113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100951#comment-16100951 ] Jason Dere commented on HIVE-17113: --- Spoke offline to [~ashutoshc], who recommended the following approach: - During Utilities.removeTempOrDuplicateFiles(), maintain a list of files found/deduped. This list of files will be used to determine which files are moved to the destination directory. - A configurable setting will be added here to control whether this file list will be used to control which files will be moved, or if the existing behavior will be used. > Duplicate bucket files can get written to table by runaway task > --- > > Key: HIVE-17113 > URL: https://issues.apache.org/jira/browse/HIVE-17113 > Project: Hive > Issue Type: Bug > Components: Query Processor >Reporter: Jason Dere >Assignee: Jason Dere > Attachments: HIVE-17113.1.patch > > > Saw a table get a duplicate bucket file from a Hive query. It looks like the > following happened: > 1. Task attempt A_0 starts,but then stops making progress > 2. The job was running with speculative execution on, and task attempt A_1 is > started > 3. Task attempt A_1 finishes execution and saves its output to the temp > directory. > 5. A task kill is sent to A_0, though this does appear to actually kill A_0 > 6. The job for the query finishes and Utilities.mvFileToFinalPath() calls > Utilities.removeTempOrDuplicateFiles() to check for duplicate bucket files > 7. A_0 (still running) finally finishes and saves its file to the temp > directory. At this point we now have duplicate bucket files - oops! > 8. Utilities.removeTempOrDuplicateFiles() moves the temp directory to the > final location, where it is later moved to the partition directory. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17113) Duplicate bucket files can get written to table by runaway task
[ https://issues.apache.org/jira/browse/HIVE-17113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100564#comment-16100564 ] Jason Dere commented on HIVE-17113: --- Looks like in the case of skewjoin in Spark, there can be multiple jobs which copy files into the same temp directory. When this happens, there can be name collisions - in the test there are collisions on files 00_0 and 01_0, which get renamed to 00_0_1 and 01_0_1. Since the removeTempOrDuplicateFiles() is now being called on the destination directory, it's not able to correctly disambiguate the 00_0_1, 01_0_1 files. Since it looks like the destination directory can potentially hold results from more than one job, it does not seem to be correct to simply run removeTempOrDuplicateFiles() on the destination directory. Maybe we have to change the logic to the following: 1) Move the temp directory to a new directory name, to prevent additional files from being added by any runaway processes. 2) Run removeTempOrDuplicateFiles() on this renamed temp directory 3) Run renameOrMoveFiles() to move the renamed temp directory to the final location. Though step 1 might be expensive for cloud storage (basically means performing twice the file moves right?) .. [~ashutoshc] should doing step 1 be a configurable setting? > Duplicate bucket files can get written to table by runaway task > --- > > Key: HIVE-17113 > URL: https://issues.apache.org/jira/browse/HIVE-17113 > Project: Hive > Issue Type: Bug > Components: Query Processor >Reporter: Jason Dere >Assignee: Jason Dere > Attachments: HIVE-17113.1.patch > > > Saw a table get a duplicate bucket file from a Hive query. It looks like the > following happened: > 1. Task attempt A_0 starts,but then stops making progress > 2. The job was running with speculative execution on, and task attempt A_1 is > started > 3. Task attempt A_1 finishes execution and saves its output to the temp > directory. > 5. A task kill is sent to A_0, though this does appear to actually kill A_0 > 6. The job for the query finishes and Utilities.mvFileToFinalPath() calls > Utilities.removeTempOrDuplicateFiles() to check for duplicate bucket files > 7. A_0 (still running) finally finishes and saves its file to the temp > directory. At this point we now have duplicate bucket files - oops! > 8. Utilities.removeTempOrDuplicateFiles() moves the temp directory to the > final location, where it is later moved to the partition directory. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17113) Duplicate bucket files can get written to table by runaway task
[ https://issues.apache.org/jira/browse/HIVE-17113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092149#comment-16092149 ] Jason Dere commented on HIVE-17113: --- Seems to be causing a failure in TestSparkCliDriver skewjoin.q > Duplicate bucket files can get written to table by runaway task > --- > > Key: HIVE-17113 > URL: https://issues.apache.org/jira/browse/HIVE-17113 > Project: Hive > Issue Type: Bug > Components: Query Processor >Reporter: Jason Dere >Assignee: Jason Dere > Attachments: HIVE-17113.1.patch > > > Saw a table get a duplicate bucket file from a Hive query. It looks like the > following happened: > 1. Task attempt A_0 starts,but then stops making progress > 2. The job was running with speculative execution on, and task attempt A_1 is > started > 3. Task attempt A_1 finishes execution and saves its output to the temp > directory. > 5. A task kill is sent to A_0, though this does appear to actually kill A_0 > 6. The job for the query finishes and Utilities.mvFileToFinalPath() calls > Utilities.removeTempOrDuplicateFiles() to check for duplicate bucket files > 7. A_0 (still running) finally finishes and saves its file to the temp > directory. At this point we now have duplicate bucket files - oops! > 8. Utilities.removeTempOrDuplicateFiles() moves the temp directory to the > final location, where it is later moved to the partition directory. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17113) Duplicate bucket files can get written to table by runaway task
[ https://issues.apache.org/jira/browse/HIVE-17113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16091048#comment-16091048 ] Hive QA commented on HIVE-17113: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12877705/HIVE-17113.1.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 15 failed/errored test(s), 11065 tests executed *Failed tests:* {noformat} TestSSL - did not produce a TEST-*.xml file (likely timed out) (batchId=224) org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[materialized_view_create_rewrite] (batchId=238) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[llap_smb] (batchId=143) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning] (batchId=167) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning_2] (batchId=169) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_explainuser_1] (batchId=168) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_use_op_stats] (batchId=167) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_use_ts_stats_for_mapjoin] (batchId=168) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_vectorized_dynamic_partition_pruning] (batchId=167) org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] (batchId=233) org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] (batchId=233) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[skewjoin] (batchId=110) org.apache.hive.hcatalog.api.TestHCatClient.testPartitionRegistrationWithCustomSchema (batchId=178) org.apache.hive.hcatalog.api.TestHCatClient.testPartitionSpecRegistrationWithCustomSchema (batchId=178) org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation (batchId=178) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6070/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6070/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6070/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 15 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12877705 - PreCommit-HIVE-Build > Duplicate bucket files can get written to table by runaway task > --- > > Key: HIVE-17113 > URL: https://issues.apache.org/jira/browse/HIVE-17113 > Project: Hive > Issue Type: Bug > Components: Query Processor >Reporter: Jason Dere >Assignee: Jason Dere > Attachments: HIVE-17113.1.patch > > > Saw a table get a duplicate bucket file from a Hive query. It looks like the > following happened: > 1. Task attempt A_0 starts,but then stops making progress > 2. The job was running with speculative execution on, and task attempt A_1 is > started > 3. Task attempt A_1 finishes execution and saves its output to the temp > directory. > 5. A task kill is sent to A_0, though this does appear to actually kill A_0 > 6. The job for the query finishes and Utilities.mvFileToFinalPath() calls > Utilities.removeTempOrDuplicateFiles() to check for duplicate bucket files > 7. A_0 (still running) finally finishes and saves its file to the temp > directory. At this point we now have duplicate bucket files - oops! > 8. Utilities.removeTempOrDuplicateFiles() moves the temp directory to the > final location, where it is later moved to the partition directory. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17113) Duplicate bucket files can get written to table by runaway task
[ https://issues.apache.org/jira/browse/HIVE-17113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090666#comment-16090666 ] Jason Dere commented on HIVE-17113: --- Talked to [~ashutoshc] and [~sseth] about this. According to Sid this is normally handled in MR using the OutputCommitter. However Ashutosh mentioned that Hive does not use the Hadoop OutputCommitter functionality and instead tries to handle duplicate task attempts by itself - thus the call to Utilities.removeTempOrDuplicateFiles(). A couple of solutions to this on the Hive side: 1) Changing Hive to properly use the OutputCommitter 2) Utiltiies.mvFileToFinalPath() should call Utilities.removeTempOrDuplicateFiles() after renaming the temp directory rather than before renaming. This is basically swapping the order of steps 6 and 8 in the Jira description, within Utilities.mvFileToFinalPath(). Gonna try to do option 2 as it looks like a simpler fix. > Duplicate bucket files can get written to table by runaway task > --- > > Key: HIVE-17113 > URL: https://issues.apache.org/jira/browse/HIVE-17113 > Project: Hive > Issue Type: Bug > Components: Query Processor >Reporter: Jason Dere >Assignee: Jason Dere > > Saw a table get a duplicate bucket file from a Hive query. It looks like the > following happened: > 1. Task attempt A_0 starts,but then stops making progress > 2. The job was running with speculative execution on, and task attempt A_1 is > started > 3. Task attempt A_1 finishes execution and saves its output to the temp > directory. > 5. A task kill is sent to A_0, though this does appear to actually kill A_0 > 6. The job for the query finishes and Utilities.mvFileToFinalPath() calls > Utilities.removeTempOrDuplicateFiles() to check for duplicate bucket files > 7. A_0 (still running) finally finishes and saves its file to the temp > directory. At this point we now have duplicate bucket files - oops! > 8. Utilities.removeTempOrDuplicateFiles() moves the temp directory to the > final location, where it is later moved to the partition directory. -- This message was sent by Atlassian JIRA (v6.4.14#64029)