[jira] [Commented] (HIVE-13858) LLAP: A preempted task can end up waiting on completeInitialization if some part of the executing code suppressed the interrupt
[ https://issues.apache.org/jira/browse/HIVE-13858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15310790#comment-15310790 ] Siddharth Seth commented on HIVE-13858: --- The test failures are unrelated. This particular run took over 4 hours compared to other runs which take 2 hours. However, from the logs it looks like the mvn build started at 23:00, and the test run completed at 01:19 - which is a 2h20m and consistent with other runs. Looks to be a delay in the run actually starting. Committing. > LLAP: A preempted task can end up waiting on completeInitialization if some > part of the executing code suppressed the interrupt > --- > > Key: HIVE-13858 > URL: https://issues.apache.org/jira/browse/HIVE-13858 > Project: Hive > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Blocker > Labels: llap > Attachments: HIVE-13858.01.patch, HIVE-13858.02.patch, > HIVE-13858.03.patch > > > An interrupt along with a HiveProcessor.abort call is made when attempting to > preempt a task. > In this specific case, the task was in the middle of HDFS IO - which > 'handled' the interrupt by retrying. As a result the interrupt status on the > thread was reset - so instead of skipping the future.get in > completeInitialization - the task ended up blocking there. > End result - a single executor slot permanently blocked in LLAP. Depending on > what else is running - this can cause a cluster level deadlock. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13858) LLAP: A preempted task can end up waiting on completeInitialization if some part of the executing code suppressed the interrupt
[ https://issues.apache.org/jira/browse/HIVE-13858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309008#comment-15309008 ] Hive QA commented on HIVE-13858: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12806662/HIVE-13858.03.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 12 failed/errored test(s), 10195 tests executed *Failed tests:* {noformat} TestHWISessionManager - did not produce a TEST-*.xml file TestJdbcWithMiniHA - did not produce a TEST-*.xml file TestJdbcWithMiniMr - did not produce a TEST-*.xml file TestOperationLoggingAPIWithTez - did not produce a TEST-*.xml file org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ivyDownload org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_list_bucket_dml_12 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats_list_bucket org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_multiinsert org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_constprog_partitioner org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_index_bitmap3 org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_complex_all org.apache.hadoop.hive.ql.lockmgr.TestDbTxnManager2.testLocksInSubquery {noformat} Test results: http://ec2-54-177-240-2.us-west-1.compute.amazonaws.com/job/PreCommit-HIVE-MASTER-Build/472/testReport Console output: http://ec2-54-177-240-2.us-west-1.compute.amazonaws.com/job/PreCommit-HIVE-MASTER-Build/472/console Test logs: http://ec2-50-18-27-0.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-MASTER-Build-472/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 12 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12806662 - PreCommit-HIVE-MASTER-Build > LLAP: A preempted task can end up waiting on completeInitialization if some > part of the executing code suppressed the interrupt > --- > > Key: HIVE-13858 > URL: https://issues.apache.org/jira/browse/HIVE-13858 > Project: Hive > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Critical > Labels: llap > Attachments: HIVE-13858.01.patch, HIVE-13858.02.patch, > HIVE-13858.03.patch > > > An interrupt along with a HiveProcessor.abort call is made when attempting to > preempt a task. > In this specific case, the task was in the middle of HDFS IO - which > 'handled' the interrupt by retrying. As a result the interrupt status on the > thread was reset - so instead of skipping the future.get in > completeInitialization - the task ended up blocking there. > End result - a single executor slot permanently blocked in LLAP. Depending on > what else is running - this can cause a cluster level deadlock. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13858) LLAP: A preempted task can end up waiting on completeInitialization if some part of the executing code suppressed the interrupt
[ https://issues.apache.org/jira/browse/HIVE-13858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308553#comment-15308553 ] Siddharth Seth commented on HIVE-13858: --- [~jcamachorodriguez] - this is pending a jenkins run - which is next in line. If it comes back clean, I'll commit it. I would ideally like to include it in 2.1 - given that preemption in LLAP can run into issues without this. If you're creating the RC tomorrow - I guess it gets documented as a known issue and will get fixed in 2.1.1 or 2.2 > LLAP: A preempted task can end up waiting on completeInitialization if some > part of the executing code suppressed the interrupt > --- > > Key: HIVE-13858 > URL: https://issues.apache.org/jira/browse/HIVE-13858 > Project: Hive > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Critical > Labels: llap > Attachments: HIVE-13858.01.patch, HIVE-13858.02.patch, > HIVE-13858.03.patch > > > An interrupt along with a HiveProcessor.abort call is made when attempting to > preempt a task. > In this specific case, the task was in the middle of HDFS IO - which > 'handled' the interrupt by retrying. As a result the interrupt status on the > thread was reset - so instead of skipping the future.get in > completeInitialization - the task ended up blocking there. > End result - a single executor slot permanently blocked in LLAP. Depending on > what else is running - this can cause a cluster level deadlock. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13858) LLAP: A preempted task can end up waiting on completeInitialization if some part of the executing code suppressed the interrupt
[ https://issues.apache.org/jira/browse/HIVE-13858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308453#comment-15308453 ] Jesus Camacho Rodriguez commented on HIVE-13858: [~sseth], what is the status on this one? I plan to create the first 2.1.0 RC tomorrow and this is marked as Critical. Should it go in and can it be deferred? Thanks > LLAP: A preempted task can end up waiting on completeInitialization if some > part of the executing code suppressed the interrupt > --- > > Key: HIVE-13858 > URL: https://issues.apache.org/jira/browse/HIVE-13858 > Project: Hive > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Critical > Labels: llap > Attachments: HIVE-13858.01.patch, HIVE-13858.02.patch, > HIVE-13858.03.patch > > > An interrupt along with a HiveProcessor.abort call is made when attempting to > preempt a task. > In this specific case, the task was in the middle of HDFS IO - which > 'handled' the interrupt by retrying. As a result the interrupt status on the > thread was reset - so instead of skipping the future.get in > completeInitialization - the task ended up blocking there. > End result - a single executor slot permanently blocked in LLAP. Depending on > what else is running - this can cause a cluster level deadlock. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13858) LLAP: A preempted task can end up waiting on completeInitialization if some part of the executing code suppressed the interrupt
[ https://issues.apache.org/jira/browse/HIVE-13858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304781#comment-15304781 ] Prasanth Jayachandran commented on HIVE-13858: -- LGMT, +1. Pending tests > LLAP: A preempted task can end up waiting on completeInitialization if some > part of the executing code suppressed the interrupt > --- > > Key: HIVE-13858 > URL: https://issues.apache.org/jira/browse/HIVE-13858 > Project: Hive > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Critical > Labels: llap > Attachments: HIVE-13858.01.patch, HIVE-13858.02.patch, > HIVE-13858.03.patch > > > An interrupt along with a HiveProcessor.abort call is made when attempting to > preempt a task. > In this specific case, the task was in the middle of HDFS IO - which > 'handled' the interrupt by retrying. As a result the interrupt status on the > thread was reset - so instead of skipping the future.get in > completeInitialization - the task ended up blocking there. > End result - a single executor slot permanently blocked in LLAP. Depending on > what else is running - this can cause a cluster level deadlock. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13858) LLAP: A preempted task can end up waiting on completeInitialization if some part of the executing code suppressed the interrupt
[ https://issues.apache.org/jira/browse/HIVE-13858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303670#comment-15303670 ] Prasanth Jayachandran commented on HIVE-13858: -- Looks like RB went down when I was commenting. Adding the comments here 1) CancellationException seems to be not caught. Is it not expected? 2) I think we can remove the TODO for throwing HiveException and replace it with InterruptedException. IIRC throwing InterruptedException will also clear the interrupt status flag, so the Thread.interrupted() call is also not required. TezProcessor anyways catches Throwable, so it should be safe to throw InterruptedException. > LLAP: A preempted task can end up waiting on completeInitialization if some > part of the executing code suppressed the interrupt > --- > > Key: HIVE-13858 > URL: https://issues.apache.org/jira/browse/HIVE-13858 > Project: Hive > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Critical > Labels: llap > Attachments: HIVE-13858.01.patch, HIVE-13858.02.patch > > > An interrupt along with a HiveProcessor.abort call is made when attempting to > preempt a task. > In this specific case, the task was in the middle of HDFS IO - which > 'handled' the interrupt by retrying. As a result the interrupt status on the > thread was reset - so instead of skipping the future.get in > completeInitialization - the task ended up blocking there. > End result - a single executor slot permanently blocked in LLAP. Depending on > what else is running - this can cause a cluster level deadlock. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13858) LLAP: A preempted task can end up waiting on completeInitialization if some part of the executing code suppressed the interrupt
[ https://issues.apache.org/jira/browse/HIVE-13858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15301307#comment-15301307 ] Prasanth Jayachandran commented on HIVE-13858: -- Left some comments in RB > LLAP: A preempted task can end up waiting on completeInitialization if some > part of the executing code suppressed the interrupt > --- > > Key: HIVE-13858 > URL: https://issues.apache.org/jira/browse/HIVE-13858 > Project: Hive > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Critical > Labels: llap > Attachments: HIVE-13858.01.patch > > > An interrupt along with a HiveProcessor.abort call is made when attempting to > preempt a task. > In this specific case, the task was in the middle of HDFS IO - which > 'handled' the interrupt by retrying. As a result the interrupt status on the > thread was reset - so instead of skipping the future.get in > completeInitialization - the task ended up blocking there. > End result - a single executor slot permanently blocked in LLAP. Depending on > what else is running - this can cause a cluster level deadlock. -- This message was sent by Atlassian JIRA (v6.3.4#6332)