[jira] [Commented] (HIVE-9976) LLAP: Possible race condition in DynamicPartitionPruner for 200ms tasks
[ https://issues.apache.org/jira/browse/HIVE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363909#comment-14363909 ] Siddharth Seth commented on HIVE-9976: -- I'll take a look. Assuming this was run with Tez 0.7 snapshot ? LLAP: Possible race condition in DynamicPartitionPruner for 200ms tasks Key: HIVE-9976 URL: https://issues.apache.org/jira/browse/HIVE-9976 Project: Hive Issue Type: Sub-task Components: Tez Affects Versions: llap Reporter: Gopal V Assignee: Gunther Hagleitner Attachments: llap_vertex_200ms.png Race condition in the DynamicPartitionPruner between DynamicPartitionPruner::processVertex() and DynamicPartitionpruner::addEvent() for tasks which respond with both the result and success in a single heartbeat sequence. {code} 2015-03-16 07:05:01,589 ERROR [InputInitializer [Map 1] #0] tez.DynamicPartitionPruner: Expecting: 1, received: 0 2015-03-16 07:05:01,590 ERROR [Dispatcher thread: Central] impl.VertexImpl: Vertex Input: store_sales initializer failed, vertex=vertex_1424502260528_1113_4_04 [Map 1] org.apache.tez.dag.app.dag.impl.AMUserCodeException: org.apache.hadoop.hive.ql.metadata.HiveException: Incorrect event count in dynamic parition pruning {code} !llap_vertex_200ms.png! All 4 upstream vertices of Map 1 need to finish within ~200ms to trigger this, which seems to be consistently happening with LLAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9756) LLAP: use log4j 2 for llap
[ https://issues.apache.org/jira/browse/HIVE-9756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363919#comment-14363919 ] Siddharth Seth commented on HIVE-9756: -- [~gopalv] - Tez is moving to slf4j in the 0.7 release (TEZ-2176). Unfortauntely, Hadoop provides log4j as well - so this may be problematic anyway. We'll find out once the Tez patch goes in. LLAP: use log4j 2 for llap -- Key: HIVE-9756 URL: https://issues.apache.org/jira/browse/HIVE-9756 Project: Hive Issue Type: Sub-task Reporter: Gunther Hagleitner Assignee: Gopal V For the INFO logging, we'll need to use the log4j-jcl 2.x upgrade-path to get throughput friendly logging. http://logging.apache.org/log4j/2.0/manual/async.html#Performance -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9999) LLAP: Handle task rejection from daemons in the AM
[ https://issues.apache.org/jira/browse/HIVE-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-: - Attachment: HIVE-.1.patch LLAP: Handle task rejection from daemons in the AM -- Key: HIVE- URL: https://issues.apache.org/jira/browse/HIVE- Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-9999) LLAP: Handle task rejection from daemons in the AM
[ https://issues.apache.org/jira/browse/HIVE-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-. -- Resolution: Fixed LLAP: Handle task rejection from daemons in the AM -- Key: HIVE- URL: https://issues.apache.org/jira/browse/HIVE- Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9912) LLAP: Improvements to the Shuffle handler to avoid unnecessary disk scans
[ https://issues.apache.org/jira/browse/HIVE-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-9912: - Attachment: HIVE-9912.1.txt Patch to cache files which have been previously scanned, and add a watcher for files being created. Also reduces the logging on new work submissions. LLAP: Improvements to the Shuffle handler to avoid unnecessary disk scans - Key: HIVE-9912 URL: https://issues.apache.org/jira/browse/HIVE-9912 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-9912) LLAP: Improvements to the Shuffle handler to avoid unnecessary disk scans
[ https://issues.apache.org/jira/browse/HIVE-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-9912. -- Resolution: Fixed LLAP: Improvements to the Shuffle handler to avoid unnecessary disk scans - Key: HIVE-9912 URL: https://issues.apache.org/jira/browse/HIVE-9912 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-9912.1.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10025) LLAP: Queued work times out
[ https://issues.apache.org/jira/browse/HIVE-10025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10025: -- Issue Type: Sub-task (was: Improvement) Parent: HIVE-7926 LLAP: Queued work times out --- Key: HIVE-10025 URL: https://issues.apache.org/jira/browse/HIVE-10025 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth If a daemon holds a task in queue for a long time, it'll eventually time out - but isn't removed from the queue. Ideally, it shouldn't be allowed to time out. Otherwise, handle the timeout so that the task doesn't run - or starts and fails - likely a change in the TaskCommunicator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9912) LLAP: Improvements to the Shuffle handler to avoid unnecessary disk scans
[ https://issues.apache.org/jira/browse/HIVE-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-9912: - Attachment: (was: HIVE-9912.1.txt) LLAP: Improvements to the Shuffle handler to avoid unnecessary disk scans - Key: HIVE-9912 URL: https://issues.apache.org/jira/browse/HIVE-9912 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9912) LLAP: Improvements to the Shuffle handler to avoid unnecessary disk scans
[ https://issues.apache.org/jira/browse/HIVE-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-9912: - Attachment: HIVE-9912.1.txt LLAP: Improvements to the Shuffle handler to avoid unnecessary disk scans - Key: HIVE-9912 URL: https://issues.apache.org/jira/browse/HIVE-9912 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-9912.1.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10026) LLAP: AM should get notifications on daemons going down or restarting
[ https://issues.apache.org/jira/browse/HIVE-10026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10026: -- Fix Version/s: llap LLAP: AM should get notifications on daemons going down or restarting - Key: HIVE-10026 URL: https://issues.apache.org/jira/browse/HIVE-10026 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Fix For: llap There's lost state otherwise, which can cause queries to hang. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10025) LLAP: Queued work times out
[ https://issues.apache.org/jira/browse/HIVE-10025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10025: -- Fix Version/s: llap LLAP: Queued work times out --- Key: HIVE-10025 URL: https://issues.apache.org/jira/browse/HIVE-10025 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Fix For: llap If a daemon holds a task in queue for a long time, it'll eventually time out - but isn't removed from the queue. Ideally, it shouldn't be allowed to time out. Otherwise, handle the timeout so that the task doesn't run - or starts and fails - likely a change in the TaskCommunicator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9807) LLAP: Add event logging for execution elements
[ https://issues.apache.org/jira/browse/HIVE-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14370599#comment-14370599 ] Siddharth Seth commented on HIVE-9807: -- This doesn't need any additional user documentation. It's meant for consumption by tools. LLAP: Add event logging for execution elements -- Key: HIVE-9807 URL: https://issues.apache.org/jira/browse/HIVE-9807 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-9807.1.patch, HIVE-9807.2.patch, llap-executors.png For analysis of runtimes, submit/start delays, interleaving etc. !llap-executors.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9808) LLAP: Push work into daemons instead of the current pull
[ https://issues.apache.org/jira/browse/HIVE-9808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-9808: - Attachment: HIVE-9808.2.txt Rebased patch. Will commit shortly; this one was painful to rebase. There's some UGI / closeAllForFileSystem changes which will need to be worked on in a follow up. LLAP: Push work into daemons instead of the current pull Key: HIVE-9808 URL: https://issues.apache.org/jira/browse/HIVE-9808 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-9808.1.txt, HIVE-9808.2.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-9808) LLAP: Push work into daemons instead of the current pull
[ https://issues.apache.org/jira/browse/HIVE-9808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-9808. -- Resolution: Fixed LLAP: Push work into daemons instead of the current pull Key: HIVE-9808 URL: https://issues.apache.org/jira/browse/HIVE-9808 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-9808.1.txt, HIVE-9808.2.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9775) LLAP: Add a MiniLLAPCluster for tests
[ https://issues.apache.org/jira/browse/HIVE-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-9775: - Attachment: HIVE-9775.2.patch Re-based patch. LLAP: Add a MiniLLAPCluster for tests - Key: HIVE-9775 URL: https://issues.apache.org/jira/browse/HIVE-9775 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-9775.1.patch, HIVE-9775.2.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-9910) LLAP: Update usage of APIs changed by TEZ-2175 and TEZ-2187
[ https://issues.apache.org/jira/browse/HIVE-9910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-9910. -- Resolution: Fixed Already committed. LLAP: Update usage of APIs changed by TEZ-2175 and TEZ-2187 --- Key: HIVE-9910 URL: https://issues.apache.org/jira/browse/HIVE-9910 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-9910.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9910) LLAP: Update usage of APIs changed by TEZ-2175 and TEZ-2187
[ https://issues.apache.org/jira/browse/HIVE-9910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-9910: - Attachment: HIVE-9910.1.patch Trivial patch. LLAP: Update usage of APIs changed by TEZ-2175 and TEZ-2187 --- Key: HIVE-9910 URL: https://issues.apache.org/jira/browse/HIVE-9910 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-9910.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9891) LLAP: disable plan caching
[ https://issues.apache.org/jira/browse/HIVE-9891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352337#comment-14352337 ] Siddharth Seth commented on HIVE-9891: -- Would be nice if the plan were immutable - I'm guessing that's a big change and is an item for later. Caching the plan and cloning it for each execution, rather than deserializing it each time may be another option. LLAP: disable plan caching -- Key: HIVE-9891 URL: https://issues.apache.org/jira/browse/HIVE-9891 Project: Hive Issue Type: Sub-task Reporter: Gunther Hagleitner Assignee: Gunther Hagleitner Attachments: HIVE-9891.1.patch Can't share the same plan objects in LLAP as they are used concurrently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9976) Possible race condition in DynamicPartitionPruner for 200ms tasks
[ https://issues.apache.org/jira/browse/HIVE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-9976: - Affects Version/s: (was: 1.1.0) 1.0.0 Fix Version/s: (was: 1.1.1) (was: 1.2.0) 1.0.1 Assignee: Siddharth Seth (was: Gunther Hagleitner) This is not limited to LLAP. Assigning to myself - to change the handling of vertex success / init events. Possible race condition in DynamicPartitionPruner for 200ms tasks -- Key: HIVE-9976 URL: https://issues.apache.org/jira/browse/HIVE-9976 Project: Hive Issue Type: Sub-task Components: Tez Affects Versions: 1.0.0 Reporter: Gopal V Assignee: Siddharth Seth Fix For: 1.0.1 Attachments: llap_vertex_200ms.png Race condition in the DynamicPartitionPruner between DynamicPartitionPruner::processVertex() and DynamicPartitionpruner::addEvent() for tasks which respond with both the result and success in a single heartbeat sequence. {code} 2015-03-16 07:05:01,589 ERROR [InputInitializer [Map 1] #0] tez.DynamicPartitionPruner: Expecting: 1, received: 0 2015-03-16 07:05:01,590 ERROR [Dispatcher thread: Central] impl.VertexImpl: Vertex Input: store_sales initializer failed, vertex=vertex_1424502260528_1113_4_04 [Map 1] org.apache.tez.dag.app.dag.impl.AMUserCodeException: org.apache.hadoop.hive.ql.metadata.HiveException: Incorrect event count in dynamic parition pruning {code} !llap_vertex_200ms.png! All 4 upstream vertices of Map 1 need to finish within ~200ms to trigger this, which seems to be consistently happening with LLAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9976) Possible race condition in DynamicPartitionPruner for 200ms tasks
[ https://issues.apache.org/jira/browse/HIVE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-9976: - Issue Type: Bug (was: Sub-task) Parent: (was: HIVE-7926) Possible race condition in DynamicPartitionPruner for 200ms tasks -- Key: HIVE-9976 URL: https://issues.apache.org/jira/browse/HIVE-9976 Project: Hive Issue Type: Bug Components: Tez Affects Versions: 1.0.0 Reporter: Gopal V Assignee: Siddharth Seth Fix For: 1.0.1 Attachments: llap_vertex_200ms.png Race condition in the DynamicPartitionPruner between DynamicPartitionPruner::processVertex() and DynamicPartitionpruner::addEvent() for tasks which respond with both the result and success in a single heartbeat sequence. {code} 2015-03-16 07:05:01,589 ERROR [InputInitializer [Map 1] #0] tez.DynamicPartitionPruner: Expecting: 1, received: 0 2015-03-16 07:05:01,590 ERROR [Dispatcher thread: Central] impl.VertexImpl: Vertex Input: store_sales initializer failed, vertex=vertex_1424502260528_1113_4_04 [Map 1] org.apache.tez.dag.app.dag.impl.AMUserCodeException: org.apache.hadoop.hive.ql.metadata.HiveException: Incorrect event count in dynamic parition pruning {code} !llap_vertex_200ms.png! All 4 upstream vertices of Map 1 need to finish within ~200ms to trigger this, which seems to be consistently happening with LLAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9976) Possible race condition in DynamicPartitionPruner for 200ms tasks
[ https://issues.apache.org/jira/browse/HIVE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-9976: - Fix Version/s: 1.1.1 1.2.0 Possible race condition in DynamicPartitionPruner for 200ms tasks -- Key: HIVE-9976 URL: https://issues.apache.org/jira/browse/HIVE-9976 Project: Hive Issue Type: Sub-task Components: Tez Affects Versions: 1.1.0 Reporter: Gopal V Assignee: Gunther Hagleitner Fix For: 1.2.0, 1.1.1 Attachments: llap_vertex_200ms.png Race condition in the DynamicPartitionPruner between DynamicPartitionPruner::processVertex() and DynamicPartitionpruner::addEvent() for tasks which respond with both the result and success in a single heartbeat sequence. {code} 2015-03-16 07:05:01,589 ERROR [InputInitializer [Map 1] #0] tez.DynamicPartitionPruner: Expecting: 1, received: 0 2015-03-16 07:05:01,590 ERROR [Dispatcher thread: Central] impl.VertexImpl: Vertex Input: store_sales initializer failed, vertex=vertex_1424502260528_1113_4_04 [Map 1] org.apache.tez.dag.app.dag.impl.AMUserCodeException: org.apache.hadoop.hive.ql.metadata.HiveException: Incorrect event count in dynamic parition pruning {code} !llap_vertex_200ms.png! All 4 upstream vertices of Map 1 need to finish within ~200ms to trigger this, which seems to be consistently happening with LLAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9976) Possible race condition in DynamicPartitionPruner for 200ms tasks
[ https://issues.apache.org/jira/browse/HIVE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-9976: - Summary: Possible race condition in DynamicPartitionPruner for 200ms tasks (was: LLAP: Possible race condition in DynamicPartitionPruner for 200ms tasks) Possible race condition in DynamicPartitionPruner for 200ms tasks -- Key: HIVE-9976 URL: https://issues.apache.org/jira/browse/HIVE-9976 Project: Hive Issue Type: Sub-task Components: Tez Affects Versions: 1.1.0 Reporter: Gopal V Assignee: Gunther Hagleitner Fix For: 1.2.0, 1.1.1 Attachments: llap_vertex_200ms.png Race condition in the DynamicPartitionPruner between DynamicPartitionPruner::processVertex() and DynamicPartitionpruner::addEvent() for tasks which respond with both the result and success in a single heartbeat sequence. {code} 2015-03-16 07:05:01,589 ERROR [InputInitializer [Map 1] #0] tez.DynamicPartitionPruner: Expecting: 1, received: 0 2015-03-16 07:05:01,590 ERROR [Dispatcher thread: Central] impl.VertexImpl: Vertex Input: store_sales initializer failed, vertex=vertex_1424502260528_1113_4_04 [Map 1] org.apache.tez.dag.app.dag.impl.AMUserCodeException: org.apache.hadoop.hive.ql.metadata.HiveException: Incorrect event count in dynamic parition pruning {code} !llap_vertex_200ms.png! All 4 upstream vertices of Map 1 need to finish within ~200ms to trigger this, which seems to be consistently happening with LLAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9976) Possible race condition in DynamicPartitionPruner for 200ms tasks
[ https://issues.apache.org/jira/browse/HIVE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-9976: - Affects Version/s: (was: llap) 1.1.0 Possible race condition in DynamicPartitionPruner for 200ms tasks -- Key: HIVE-9976 URL: https://issues.apache.org/jira/browse/HIVE-9976 Project: Hive Issue Type: Sub-task Components: Tez Affects Versions: 1.1.0 Reporter: Gopal V Assignee: Gunther Hagleitner Fix For: 1.2.0, 1.1.1 Attachments: llap_vertex_200ms.png Race condition in the DynamicPartitionPruner between DynamicPartitionPruner::processVertex() and DynamicPartitionpruner::addEvent() for tasks which respond with both the result and success in a single heartbeat sequence. {code} 2015-03-16 07:05:01,589 ERROR [InputInitializer [Map 1] #0] tez.DynamicPartitionPruner: Expecting: 1, received: 0 2015-03-16 07:05:01,590 ERROR [Dispatcher thread: Central] impl.VertexImpl: Vertex Input: store_sales initializer failed, vertex=vertex_1424502260528_1113_4_04 [Map 1] org.apache.tez.dag.app.dag.impl.AMUserCodeException: org.apache.hadoop.hive.ql.metadata.HiveException: Incorrect event count in dynamic parition pruning {code} !llap_vertex_200ms.png! All 4 upstream vertices of Map 1 need to finish within ~200ms to trigger this, which seems to be consistently happening with LLAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9976) Possible race condition in DynamicPartitionPruner for 200ms tasks
[ https://issues.apache.org/jira/browse/HIVE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-9976: - Attachment: HIVE-9976.1.patch Patch to handle out of order events. Also initializes the pruner during Input construction - so that events don't show up before the pruner is initialized. Adds a bunch of tests. [~hagleitn], [~vikram.dixit] - please review. Possible race condition in DynamicPartitionPruner for 200ms tasks -- Key: HIVE-9976 URL: https://issues.apache.org/jira/browse/HIVE-9976 Project: Hive Issue Type: Bug Components: Tez Affects Versions: 1.0.0 Reporter: Gopal V Assignee: Siddharth Seth Fix For: 1.0.1 Attachments: HIVE-9976.1.patch, llap_vertex_200ms.png Race condition in the DynamicPartitionPruner between DynamicPartitionPruner::processVertex() and DynamicPartitionpruner::addEvent() for tasks which respond with both the result and success in a single heartbeat sequence. {code} 2015-03-16 07:05:01,589 ERROR [InputInitializer [Map 1] #0] tez.DynamicPartitionPruner: Expecting: 1, received: 0 2015-03-16 07:05:01,590 ERROR [Dispatcher thread: Central] impl.VertexImpl: Vertex Input: store_sales initializer failed, vertex=vertex_1424502260528_1113_4_04 [Map 1] org.apache.tez.dag.app.dag.impl.AMUserCodeException: org.apache.hadoop.hive.ql.metadata.HiveException: Incorrect event count in dynamic parition pruning {code} !llap_vertex_200ms.png! All 4 upstream vertices of Map 1 need to finish within ~200ms to trigger this, which seems to be consistently happening with LLAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9976) Possible race condition in DynamicPartitionPruner for 200ms tasks
[ https://issues.apache.org/jira/browse/HIVE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-9976: - Attachment: HIVE-9976.2.patch Thanks for the review. Updated patch with comments addressed, and some more changes. bq. Not your fault - but there are 2 paths through HiveSplitGenerator. Moved the methods into SplitGrouper. There's a static cache in there which seems a little strange. Will create a follow up jira to investigate this. For now I've changed that to a ConcurrentMap since split generation can run in parallel. bq. i see you've fixed calling close consistently on the data input stream. maybe use try{}finally there? Fixed. There was a bug with some of the other conditions which I'd changed. Fixed that as well. bq. it seems you're setting numexpectedevents to 0 first and then turn around and call decrement. Why not just set to -1? Also - why atomic integers? as far as i can tell all access to these maps is synchronized. numExpectedEvents is decremented for each column for which a source will send events. That's used to track total number of expected events from that source. Added a comment for this. Moved from AtomicIntegers to MutableInt - this was just to avoid re-inserting the Integer into the map, and not for thread safety. bq. does it make sense to make initialize in the pruner private now? (can't be used to init anymore - only from the constr). Also, the parameters aren't used anymore, right? Done, along with some other methods. Possible race condition in DynamicPartitionPruner for 200ms tasks -- Key: HIVE-9976 URL: https://issues.apache.org/jira/browse/HIVE-9976 Project: Hive Issue Type: Bug Components: Tez Affects Versions: 1.0.0 Reporter: Gopal V Assignee: Siddharth Seth Attachments: HIVE-9976.1.patch, HIVE-9976.2.patch, llap_vertex_200ms.png Race condition in the DynamicPartitionPruner between DynamicPartitionPruner::processVertex() and DynamicPartitionpruner::addEvent() for tasks which respond with both the result and success in a single heartbeat sequence. {code} 2015-03-16 07:05:01,589 ERROR [InputInitializer [Map 1] #0] tez.DynamicPartitionPruner: Expecting: 1, received: 0 2015-03-16 07:05:01,590 ERROR [Dispatcher thread: Central] impl.VertexImpl: Vertex Input: store_sales initializer failed, vertex=vertex_1424502260528_1113_4_04 [Map 1] org.apache.tez.dag.app.dag.impl.AMUserCodeException: org.apache.hadoop.hive.ql.metadata.HiveException: Incorrect event count in dynamic parition pruning {code} !llap_vertex_200ms.png! All 4 upstream vertices of Map 1 need to finish within ~200ms to trigger this, which seems to be consistently happening with LLAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-10104) LLAP: Generate consistent splits and locations for the same split across jobs
[ https://issues.apache.org/jira/browse/HIVE-10104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-10104. --- Resolution: Fixed LLAP: Generate consistent splits and locations for the same split across jobs - Key: HIVE-10104 URL: https://issues.apache.org/jira/browse/HIVE-10104 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10104.1.txt, HIVE-10104.2.txt Locations for splits are currently randomized. Also, the order of splits is random - depending on how threads end up generating the splits. Add an option to sort the splits, and generate repeatable locations - assuming all other factors are the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10106) Regression : Dynamic partition pruning not working after HIVE-9976
[ https://issues.apache.org/jira/browse/HIVE-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10106: -- Attachment: HIVE-10106.1.patch I believe this is caused by the way mapWork is setup. It's now created in the constructor of HiveSplitGenerator. The constructor and the initialize method may not be invoked in the same thread. As a result, the initialize method ends up seing a different copy of mapWork from the one modified in the pruner. Attaching a patch to fix this - by setting the mapWork in the initialize method. [~hagleitn] - please review, and validate the theory. [~mmokhtar] - I wasn't able to reproduce this. Seing pruning work as it should for the simple query that you'd sent me offline. May need help reproducing the issue and validating the patch. Thanks Regression : Dynamic partition pruning not working after HIVE-9976 -- Key: HIVE-10106 URL: https://issues.apache.org/jira/browse/HIVE-10106 Project: Hive Issue Type: Bug Components: Hive Affects Versions: 1.2.0 Reporter: Mostafa Mokhtar Assignee: Siddharth Seth Fix For: 1.2.0 Attachments: HIVE-10106.1.patch After HIVE-9976 got checked in dynamic partition pruning doesn't work. Partitions are pruned and later show up in splits. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9775) LLAP: Add a MiniLLAPCluster for tests
[ https://issues.apache.org/jira/browse/HIVE-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-9775: - Attachment: HIVE-9775.1.patch Patch to add a MiniLLAPCluster. This isn't wired into the tests and shims just yet - that needs some more work with circular dependencies and such. Will figure that out in a separate jira. Applies on top of HIVE-9808. LLAP: Add a MiniLLAPCluster for tests - Key: HIVE-9775 URL: https://issues.apache.org/jira/browse/HIVE-9775 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-9775.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9807) LLAP: Add event logging for execution elements
[ https://issues.apache.org/jira/browse/HIVE-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-9807: - Attachment: HIVE-9807.1.patch Sample log lines - additional details will be populated in a later patch, when available. {code} Event=FRAGMENT_START, HostName=hw10890, ApplicationId=application_1425008147866_0006, ContainerId=container_1_0006_01_01, DagName=null, VertexName=null, TaskId=-1, TaskAttemptId=-1, SubmitTime=1425020533780 Event=FRAGMENT_END, HostName=hw10890, ApplicationId=application_1425008147866_0006, ContainerId=container_1_0006_01_01, DagName=null, VertexName=null, TaskId=-1, TaskAttemptId=-1, Succeeded=true, StartTime=1425020533779, EndTime=1425020535678 {code} cc [~gopalv] LLAP: Add event logging for execution elements -- Key: HIVE-9807 URL: https://issues.apache.org/jira/browse/HIVE-9807 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: HIVE-9807.1.patch For analysis of runtimes, interleaving etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9808) LLAP: Push work into daemons instead of the current pull
[ https://issues.apache.org/jira/browse/HIVE-9808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340878#comment-14340878 ] Siddharth Seth commented on HIVE-9808: -- Applies on top of TEZ-9807. LLAP: Push work into daemons instead of the current pull Key: HIVE-9808 URL: https://issues.apache.org/jira/browse/HIVE-9808 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-9808.1.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10113) LLAP: reducers running in LLAP starve out map retries
[ https://issues.apache.org/jira/browse/HIVE-10113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384247#comment-14384247 ] Siddharth Seth commented on HIVE-10113: --- Related: https://issues.apache.org/jira/browse/HIVE-10029 This is expected at the moment. Until we support pre-empting tasks / removal of tasks from queues. LLAP: reducers running in LLAP starve out map retries - Key: HIVE-10113 URL: https://issues.apache.org/jira/browse/HIVE-10113 Project: Hive Issue Type: Sub-task Reporter: Sergey Shelukhin Assignee: Gunther Hagleitner When query 17 is run, some mappers from Map 1 currently fail (due to unwrap issue, and also due to HIVE-10112). This query has 1000+ reducers; if they are ran in llap, they all queue up, and the query locks up. If only mappers run in LLAP, query completes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10117) LLAP: Use task number, attempt number to cache plans
[ https://issues.apache.org/jira/browse/HIVE-10117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10117: -- Fix Version/s: llap LLAP: Use task number, attempt number to cache plans Key: HIVE-10117 URL: https://issues.apache.org/jira/browse/HIVE-10117 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Fix For: llap Instead of relying on thread locals only. This can be used to share the work between Inputs / Processor / Outputs in Tez. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10104) LLAP: Generate consistent splits and locations for the same split across jobs
[ https://issues.apache.org/jira/browse/HIVE-10104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10104: -- Attachment: HIVE-10104.1.txt Patch to order the original splits by size and name. Location is based on a hash of the filename and start position. [~hagleitn] - could you please take a quick look for sanity. Will commit after I'm able to test it a bit on a cluster larger than 1 node. LLAP: Generate consistent splits and locations for the same split across jobs - Key: HIVE-10104 URL: https://issues.apache.org/jira/browse/HIVE-10104 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10104.1.txt Locations for splits are currently randomized. Also, the order of splits is random - depending on how threads end up generating the splits. Add an option to sort the splits, and generate repeatable locations - assuming all other factors are the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10104) LLAP: Generate consistent splits and locations for the same split across jobs
[ https://issues.apache.org/jira/browse/HIVE-10104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10104: -- Attachment: HIVE-10104.2.txt Updated patch with the sort removed from the scheduler. Tested on a multi-node cluster. Will commit after the next rebase of the LLAP branch. LLAP: Generate consistent splits and locations for the same split across jobs - Key: HIVE-10104 URL: https://issues.apache.org/jira/browse/HIVE-10104 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10104.1.txt, HIVE-10104.2.txt Locations for splits are currently randomized. Also, the order of splits is random - depending on how threads end up generating the splits. Add an option to sort the splits, and generate repeatable locations - assuming all other factors are the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-9913) LLAP: Avoid fetching data multiple times in case of broadcast
[ https://issues.apache.org/jira/browse/HIVE-9913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-9913. -- Resolution: Fixed LLAP: Avoid fetching data multiple times in case of broadcast - Key: HIVE-9913 URL: https://issues.apache.org/jira/browse/HIVE-9913 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-9913.1.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9913) LLAP: Avoid fetching data multiple times in case of broadcast
[ https://issues.apache.org/jira/browse/HIVE-9913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-9913: - Attachment: HIVE-9913.1.txt Patch delays the start to when the Input is actually used for Unordered cases (broadcast and non-broadcast for now), which is soon after the Processor starts running. LLAP: Avoid fetching data multiple times in case of broadcast - Key: HIVE-9913 URL: https://issues.apache.org/jira/browse/HIVE-9913 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-9913.1.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10408) LLAP: NPE in scheduler in case of rejected tasks
[ https://issues.apache.org/jira/browse/HIVE-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10408: -- Summary: LLAP: NPE in scheduler in case of rejected tasks (was: LLAP: query fails - NPE (old exception I posted was bogus)) LLAP: NPE in scheduler in case of rejected tasks Key: HIVE-10408 URL: https://issues.apache.org/jira/browse/HIVE-10408 Project: Hive Issue Type: Sub-task Reporter: Sergey Shelukhin Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10408.1.txt {noformat} java.lang.NullPointerException at org.apache.tez.dag.app.rm.LlapTaskSchedulerService.deallocateTask(LlapTaskSchedulerService.java:388) at org.apache.tez.dag.app.rm.TaskSchedulerEventHandler.handleTASucceeded(TaskSchedulerEventHandler.java:339) at org.apache.tez.dag.app.rm.TaskSchedulerEventHandler.handleEvent(TaskSchedulerEventHandler.java:224) at org.apache.tez.dag.app.rm.TaskSchedulerEventHandler$1.run(TaskSchedulerEventHandler.java:493) {noformat} The query, running alone on 10-node cluster, dumped 1000 mappers into running; with 3 completed it failed with that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10475) LLAP: Minor fixes after tez api enhancements for dag completion
[ https://issues.apache.org/jira/browse/HIVE-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10475: -- Attachment: HIVE-10475.1.txt LLAP: Minor fixes after tez api enhancements for dag completion --- Key: HIVE-10475 URL: https://issues.apache.org/jira/browse/HIVE-10475 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10475.1.txt TEZ-2212 and TEZ-2361 add APIs to propagate dag completion information to the TaskCommunicator plugin. This jira is for minor fixes to get the llap branch to compile against these changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10405) LLAP: Provide runtime information to daemons to decide on preemption order
[ https://issues.apache.org/jira/browse/HIVE-10405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10405: -- Attachment: HIVE-10405.1.txt The following information is sent into daemons at fragment submission time - start time of the dag - start time of the first attempt of a specific fragment - The priority of a fragment within an executing dag - determined by the topological order in the DAG (this is irrelevant across DAGs) - number of tasks in the current vertex + upstream to the current vertex - number of completed tasks in the current vertex + upstream to the current vertex. LLAP: Provide runtime information to daemons to decide on preemption order -- Key: HIVE-10405 URL: https://issues.apache.org/jira/browse/HIVE-10405 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10405.1.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10424) LLAP: Factor known capacity into scheduling decisions
[ https://issues.apache.org/jira/browse/HIVE-10424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10424: -- Attachment: HIVE-10424.1.txt Patch to factor in running queue + wait queue capacity per node. Also moves all scheduler on to a single thread - requests go on to a queue and are taken off whenever a node becomes available, or has capacity. Can run with the old 'unlimited' capacity by setting llap.task.scheduler.num.schedulable.tasks.per.node to -1 LLAP: Factor known capacity into scheduling decisions - Key: HIVE-10424 URL: https://issues.apache.org/jira/browse/HIVE-10424 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10424.1.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-10424) LLAP: Factor known capacity into scheduling decisions
[ https://issues.apache.org/jira/browse/HIVE-10424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-10424. --- Resolution: Fixed LLAP: Factor known capacity into scheduling decisions - Key: HIVE-10424 URL: https://issues.apache.org/jira/browse/HIVE-10424 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10424.1.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10394) LLAP: Notify AM of pre-emption
[ https://issues.apache.org/jira/browse/HIVE-10394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503373#comment-14503373 ] Siddharth Seth commented on HIVE-10394: --- The information isn't actually being sent across to the AM. What's handled right now is a response to the submitWork request. However, once a request moves onto the scheduler queue for execution at a later point - an RPC invocaiton will be required to inform the AM about the status of the task. This would be an addition to LlapTaskUmbilicalProtocol. LLAP: Notify AM of pre-emption -- Key: HIVE-10394 URL: https://issues.apache.org/jira/browse/HIVE-10394 Project: Hive Issue Type: Sub-task Affects Versions: llap Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Attachments: HIVE-10394.1.patch Pre-empted tasks should be notified to AM as killed/interrupted by system. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10480) LLAP: Tez task is interrupted for unknown reason after an IPC exception and then fails to report completion
[ https://issues.apache.org/jira/browse/HIVE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511738#comment-14511738 ] Siddharth Seth commented on HIVE-10480: --- This is what is happening here. The heartbeat being sent out for a task (from the LLAP daemon) to the AM is corrupt - TEZ-2367. This causes an error to be reported and an Interrupt on the task. The NPE and Ignoring exception can be ignored - that's caused by the task being unregistered as Prasanth pointed out. It's not the root cause of failure, and logging it always causes confusion. The log line has already been pruned in Tez (yesterday). Since the daemon considers the task to be dead - it won't send another heartbeat to the AM. The AM has no idea that the task is dead - since the last heartbeat was corrupt. The regular timeout mechanism kicks in, and the task is considered dead after 5 minutes (the default timeout). A new attempt of the same task is setup and runs to completion. LLAP: Tez task is interrupted for unknown reason after an IPC exception and then fails to report completion --- Key: HIVE-10480 URL: https://issues.apache.org/jira/browse/HIVE-10480 Project: Hive Issue Type: Sub-task Reporter: Sergey Shelukhin No idea if this is LLAP bug, Tez bug, Hadoop IPC bug (due to patch on the cluster), or all 3. So for now I will just dump all I have here. TPCH Q1 started running for a long time for me on large number of runs today (didn't happen yesterday). It would always be one Map task timing out. Example attempt (logs from am): {noformat} 2015-04-24 11:11:01,073 INFO [TaskCommunicator # 0] tezplugins.LlapTaskCommunicator: Successfully launched task: attempt_1429683757595_0321_9_00_000928_0 2015-04-24 11:16:25,498 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1429683757595_0321_9][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1429683757595_0321_9_00_000928_0, startTime=1429899061071, finishTime=1429899385498, timeTaken=324427, status=FAILED, errorEnum=TASK_HEARTBEAT_ERROR, diagnostics=AttemptID:attempt_1429683757595_0321_9_00_000928_0 Timed out after 300 secs, counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, RACK_LOCAL_TASKS=1 {noformat} No other lines for this attempt in between. However there's this: {noformat} 2015-04-24 11:11:01,074 WARN [Socket Reader #1 for port 59446] ipc.Server: Unable to read call parameters for client 172.19.128.56on connection protocol org.apache.hadoop.hive.llap.protocol.LlapTaskUmbilicalProtocol for rpcKind RPC_WRITABLE java.lang.ArrayIndexOutOfBoundsException 2015-04-24 11:11:01,075 INFO [Socket Reader #1 for port 59446] ipc.Server: Socket Reader #1 for port 59446: readAndProcess from client 172.19.128.56 threw exception [org.apache.hadoop.ipc.RpcServerException: IPC server unable to read call parameters: null] {noformat} On LLAP, the following is logged {noformat} 2015-04-24 11:11:01,142 [TaskHeartbeatThread()] ERROR org.apache.tez.runtime.task.TezTaskRunner: TaskReporter reported error org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException): IPC server unable to read call parameters: null at org.apache.hadoop.ipc.Client.call(Client.java:1492) at org.apache.hadoop.ipc.Client.call(Client.java:1423) at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:242) at com.sun.proxy.$Proxy19.heartbeat(Unknown Source) at org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.heartbeat(LlapTaskReporter.java:258) at org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.call(LlapTaskReporter.java:186) at org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.call(LlapTaskReporter.java:128) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} The attempt starts but is then interrupted (not clear by whom) {noformat} 2015-04-24 11:11:01,144 [Initializer 0(container_1_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map 1_928_0)] INFO org.apache.tez.runtime.LogicalIOProcessorRuntimeTask: Initialized Input with src edge: lineitem 2015-04-24 11:11:01,145 [TezTaskRunner_attempt_1429683757595_0321_9_00_000928_0(container_1_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map 1_928_0)] INFO
[jira] [Resolved] (HIVE-10475) LLAP: Minor fixes after tez api enhancements for dag completion
[ https://issues.apache.org/jira/browse/HIVE-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-10475. --- Resolution: Fixed LLAP: Minor fixes after tez api enhancements for dag completion --- Key: HIVE-10475 URL: https://issues.apache.org/jira/browse/HIVE-10475 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10475.1.txt TEZ-2212 and TEZ-2361 add APIs to propagate dag completion information to the TaskCommunicator plugin. This jira is for minor fixes to get the llap branch to compile against these changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HIVE-10480) LLAP: Tez task is interrupted for unknown reason after an IPC exception and then fails to report completion
[ https://issues.apache.org/jira/browse/HIVE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth reassigned HIVE-10480: - Assignee: Siddharth Seth LLAP: Tez task is interrupted for unknown reason after an IPC exception and then fails to report completion --- Key: HIVE-10480 URL: https://issues.apache.org/jira/browse/HIVE-10480 Project: Hive Issue Type: Sub-task Reporter: Sergey Shelukhin Assignee: Siddharth Seth Attachments: HIVE-10480.1.txt No idea if this is LLAP bug, Tez bug, Hadoop IPC bug (due to patch on the cluster), or all 3. So for now I will just dump all I have here. TPCH Q1 started running for a long time for me on large number of runs today (didn't happen yesterday). It would always be one Map task timing out. Example attempt (logs from am): {noformat} 2015-04-24 11:11:01,073 INFO [TaskCommunicator # 0] tezplugins.LlapTaskCommunicator: Successfully launched task: attempt_1429683757595_0321_9_00_000928_0 2015-04-24 11:16:25,498 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1429683757595_0321_9][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1429683757595_0321_9_00_000928_0, startTime=1429899061071, finishTime=1429899385498, timeTaken=324427, status=FAILED, errorEnum=TASK_HEARTBEAT_ERROR, diagnostics=AttemptID:attempt_1429683757595_0321_9_00_000928_0 Timed out after 300 secs, counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, RACK_LOCAL_TASKS=1 {noformat} No other lines for this attempt in between. However there's this: {noformat} 2015-04-24 11:11:01,074 WARN [Socket Reader #1 for port 59446] ipc.Server: Unable to read call parameters for client 172.19.128.56on connection protocol org.apache.hadoop.hive.llap.protocol.LlapTaskUmbilicalProtocol for rpcKind RPC_WRITABLE java.lang.ArrayIndexOutOfBoundsException 2015-04-24 11:11:01,075 INFO [Socket Reader #1 for port 59446] ipc.Server: Socket Reader #1 for port 59446: readAndProcess from client 172.19.128.56 threw exception [org.apache.hadoop.ipc.RpcServerException: IPC server unable to read call parameters: null] {noformat} On LLAP, the following is logged {noformat} 2015-04-24 11:11:01,142 [TaskHeartbeatThread()] ERROR org.apache.tez.runtime.task.TezTaskRunner: TaskReporter reported error org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException): IPC server unable to read call parameters: null at org.apache.hadoop.ipc.Client.call(Client.java:1492) at org.apache.hadoop.ipc.Client.call(Client.java:1423) at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:242) at com.sun.proxy.$Proxy19.heartbeat(Unknown Source) at org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.heartbeat(LlapTaskReporter.java:258) at org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.call(LlapTaskReporter.java:186) at org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.call(LlapTaskReporter.java:128) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} The attempt starts but is then interrupted (not clear by whom) {noformat} 2015-04-24 11:11:01,144 [Initializer 0(container_1_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map 1_928_0)] INFO org.apache.tez.runtime.LogicalIOProcessorRuntimeTask: Initialized Input with src edge: lineitem 2015-04-24 11:11:01,145 [TezTaskRunner_attempt_1429683757595_0321_9_00_000928_0(container_1_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map 1_928_0)] INFO org.apache.tez.runtime.task.TezTaskRunner: Encounted an error while executing task: attempt_1429683757595_0321_9_00_000928_0 java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439) at java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.initialize(LogicalIOProcessorRuntimeTask.java:218) at
[jira] [Commented] (HIVE-10482) LLAP: AsertionError cannot allocate when reading from orc
[ https://issues.apache.org/jira/browse/HIVE-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522282#comment-14522282 ] Siddharth Seth commented on HIVE-10482: --- The default. 1 GB I believe. LLAP: AsertionError cannot allocate when reading from orc - Key: HIVE-10482 URL: https://issues.apache.org/jira/browse/HIVE-10482 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Sergey Shelukhin Fix For: llap This was from a run of tpch query 1. [~sershe] - not sure if you've already seen this. Creating a jira so that it doesn't get lost. {code} 2015-04-24 13:11:54,180 [TezTaskRunner_attempt_1429683757595_0326_4_00_000199_0(container_1_0326_01_003216_sseth_20150424131137_8ec6200c-77c8-43ea-a6a3-a0ab1da6e1ac:4_Map 1_199_0)] ERROR org.apache.hadoop.hive.ql.exec.tez.TezProcessor: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.io.IOException: java.lang.AssertionError: Cannot allocate at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:74) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:314) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:148) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:329) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: java.io.IOException: java.lang.AssertionError: Cannot allocate at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:355) at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79) at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:137) at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62) ... 16 more Caused by: java.io.IOException: java.lang.AssertionError: Cannot allocate at org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.rethrowErrorIfAny(LlapInputFormat.java:257) at org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.nextCvb(LlapInputFormat.java:209) at org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.next(LlapInputFormat.java:147) at org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.next(LlapInputFormat.java:97) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:350) ... 22 more Caused by: java.lang.AssertionError: Cannot allocate at org.apache.hadoop.hive.ql.io.orc.InStream.readEncodedStream(InStream.java:761) at org.apache.hadoop.hive.ql.io.orc.EncodedReaderImpl.readEncodedColumns(EncodedReaderImpl.java:441) at org.apache.hadoop.hive.llap.io.encoded.OrcEncodedDataReader.callInternal(OrcEncodedDataReader.java:294)
[jira] [Resolved] (HIVE-10560) LLAP: a different NPE in shuffle
[ https://issues.apache.org/jira/browse/HIVE-10560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-10560. --- Resolution: Fixed Fix Version/s: llap LLAP: a different NPE in shuffle Key: HIVE-10560 URL: https://issues.apache.org/jira/browse/HIVE-10560 Project: Hive Issue Type: Sub-task Reporter: Sergey Shelukhin Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10560.1.txt Lots of those in Query 1 logs; ran just now on 8 daemons on recent version. {noformat} java.lang.NullPointerException at org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler.unregisterDag(ShuffleHandler.java:437) at org.apache.hadoop.hive.llap.daemon.impl.QueryTracker.queryComplete(QueryTracker.java:81) at org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl.queryComplete(ContainerRunnerImpl.java:214) at org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon.queryComplete(LlapDaemon.java:271) at org.apache.hadoop.hive.llap.daemon.impl.LlapDaemonProtocolServerImpl.queryComplete(LlapDaemonProtocolServerImpl.java:94) at org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$LlapDaemonProtocol$2.callBlockingMethod(LlapDaemonProtocolProtos.java:12278) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:972) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2088) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2084) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2082) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10560) LLAP: a different NPE in shuffle
[ https://issues.apache.org/jira/browse/HIVE-10560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10560: -- Attachment: HIVE-10560.1.txt Forgot to add a check for the dirWatcher in case it's disabled. This should fix it. LLAP: a different NPE in shuffle Key: HIVE-10560 URL: https://issues.apache.org/jira/browse/HIVE-10560 Project: Hive Issue Type: Sub-task Reporter: Sergey Shelukhin Assignee: Siddharth Seth Attachments: HIVE-10560.1.txt Lots of those in Query 1 logs; ran just now on 8 daemons on recent version. {noformat} java.lang.NullPointerException at org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler.unregisterDag(ShuffleHandler.java:437) at org.apache.hadoop.hive.llap.daemon.impl.QueryTracker.queryComplete(QueryTracker.java:81) at org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl.queryComplete(ContainerRunnerImpl.java:214) at org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon.queryComplete(LlapDaemon.java:271) at org.apache.hadoop.hive.llap.daemon.impl.LlapDaemonProtocolServerImpl.queryComplete(LlapDaemonProtocolServerImpl.java:94) at org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$LlapDaemonProtocol$2.callBlockingMethod(LlapDaemonProtocolProtos.java:12278) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:972) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2088) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2084) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2082) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-9911) LLAP: Clean up structures and intermediate data when a query completes
[ https://issues.apache.org/jira/browse/HIVE-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-9911. -- Resolution: Fixed Committed to branch. LLAP: Clean up structures and intermediate data when a query completes -- Key: HIVE-9911 URL: https://issues.apache.org/jira/browse/HIVE-9911 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-9911.1.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10480) LLAP: Tez task is interrupted for unknown reason after an IPC exception and then fails to report completion
[ https://issues.apache.org/jira/browse/HIVE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10480: -- Attachment: HIVE-10480.1.txt TEZ-2367 fixes this in a way which requires a minor update to the llap reporter. Uploading a patch for this. LLAP: Tez task is interrupted for unknown reason after an IPC exception and then fails to report completion --- Key: HIVE-10480 URL: https://issues.apache.org/jira/browse/HIVE-10480 Project: Hive Issue Type: Sub-task Reporter: Sergey Shelukhin Attachments: HIVE-10480.1.txt No idea if this is LLAP bug, Tez bug, Hadoop IPC bug (due to patch on the cluster), or all 3. So for now I will just dump all I have here. TPCH Q1 started running for a long time for me on large number of runs today (didn't happen yesterday). It would always be one Map task timing out. Example attempt (logs from am): {noformat} 2015-04-24 11:11:01,073 INFO [TaskCommunicator # 0] tezplugins.LlapTaskCommunicator: Successfully launched task: attempt_1429683757595_0321_9_00_000928_0 2015-04-24 11:16:25,498 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1429683757595_0321_9][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1429683757595_0321_9_00_000928_0, startTime=1429899061071, finishTime=1429899385498, timeTaken=324427, status=FAILED, errorEnum=TASK_HEARTBEAT_ERROR, diagnostics=AttemptID:attempt_1429683757595_0321_9_00_000928_0 Timed out after 300 secs, counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, RACK_LOCAL_TASKS=1 {noformat} No other lines for this attempt in between. However there's this: {noformat} 2015-04-24 11:11:01,074 WARN [Socket Reader #1 for port 59446] ipc.Server: Unable to read call parameters for client 172.19.128.56on connection protocol org.apache.hadoop.hive.llap.protocol.LlapTaskUmbilicalProtocol for rpcKind RPC_WRITABLE java.lang.ArrayIndexOutOfBoundsException 2015-04-24 11:11:01,075 INFO [Socket Reader #1 for port 59446] ipc.Server: Socket Reader #1 for port 59446: readAndProcess from client 172.19.128.56 threw exception [org.apache.hadoop.ipc.RpcServerException: IPC server unable to read call parameters: null] {noformat} On LLAP, the following is logged {noformat} 2015-04-24 11:11:01,142 [TaskHeartbeatThread()] ERROR org.apache.tez.runtime.task.TezTaskRunner: TaskReporter reported error org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException): IPC server unable to read call parameters: null at org.apache.hadoop.ipc.Client.call(Client.java:1492) at org.apache.hadoop.ipc.Client.call(Client.java:1423) at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:242) at com.sun.proxy.$Proxy19.heartbeat(Unknown Source) at org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.heartbeat(LlapTaskReporter.java:258) at org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.call(LlapTaskReporter.java:186) at org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.call(LlapTaskReporter.java:128) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} The attempt starts but is then interrupted (not clear by whom) {noformat} 2015-04-24 11:11:01,144 [Initializer 0(container_1_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map 1_928_0)] INFO org.apache.tez.runtime.LogicalIOProcessorRuntimeTask: Initialized Input with src edge: lineitem 2015-04-24 11:11:01,145 [TezTaskRunner_attempt_1429683757595_0321_9_00_000928_0(container_1_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map 1_928_0)] INFO org.apache.tez.runtime.task.TezTaskRunner: Encounted an error while executing task: attempt_1429683757595_0321_9_00_000928_0 java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439) at java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193) at
[jira] [Updated] (HIVE-9911) LLAP: Clean up structures and intermediate data when a query completes
[ https://issues.apache.org/jira/browse/HIVE-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-9911: - Attachment: HIVE-9911.1.txt The patch changes intermediate data to be written to a dag specific directory, which gets cleaned up when a dag completes. LLAP: Clean up structures and intermediate data when a query completes -- Key: HIVE-9911 URL: https://issues.apache.org/jira/browse/HIVE-9911 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-9911.1.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HIVE-9911) LLAP: Clean up structures and intermediate data when a query completes
[ https://issues.apache.org/jira/browse/HIVE-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth reassigned HIVE-9911: Assignee: Siddharth Seth LLAP: Clean up structures and intermediate data when a query completes -- Key: HIVE-9911 URL: https://issues.apache.org/jira/browse/HIVE-9911 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10682) LLAP: Make use of the task runner which allows killing tasks
[ https://issues.apache.org/jira/browse/HIVE-10682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10682: -- Attachment: HIVE-10682.1.txt LLAP: Make use of the task runner which allows killing tasks Key: HIVE-10682 URL: https://issues.apache.org/jira/browse/HIVE-10682 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10682.1.txt TEZ-2434 adds a runner which allows tasks to be killed. Jira to integrate with that without the actual kill functionality. That will follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (HIVE-10682) LLAP: Make use of the task runner which allows killing tasks
[ https://issues.apache.org/jira/browse/HIVE-10682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth reopened HIVE-10682: --- LLAP: Make use of the task runner which allows killing tasks Key: HIVE-10682 URL: https://issues.apache.org/jira/browse/HIVE-10682 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10682.1.txt TEZ-2434 adds a runner which allows tasks to be killed. Jira to integrate with that without the actual kill functionality. That will follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-10682) LLAP: Make use of the task runner which allows killing tasks
[ https://issues.apache.org/jira/browse/HIVE-10682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-10682. --- Resolution: Fixed LLAP: Make use of the task runner which allows killing tasks Key: HIVE-10682 URL: https://issues.apache.org/jira/browse/HIVE-10682 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10682.1.txt TEZ-2434 adds a runner which allows tasks to be killed. Jira to integrate with that without the actual kill functionality. That will follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-10682) LLAP: Make use of the task runner which allows killing tasks
[ https://issues.apache.org/jira/browse/HIVE-10682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-10682. --- Resolution: Pending Closed LLAP: Make use of the task runner which allows killing tasks Key: HIVE-10682 URL: https://issues.apache.org/jira/browse/HIVE-10682 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10682.1.txt TEZ-2434 adds a runner which allows tasks to be killed. Jira to integrate with that without the actual kill functionality. That will follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10700) LLAP: Log additional debug information in the scheduler
[ https://issues.apache.org/jira/browse/HIVE-10700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10700: -- Attachment: HIVE-10700.1.txt LLAP: Log additional debug information in the scheduler --- Key: HIVE-10700 URL: https://issues.apache.org/jira/browse/HIVE-10700 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10700.1.txt Temporarily, while we're debugging issues. Change to the DBEUG log level is too verbose. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-10700) LLAP: Log additional debug information in the scheduler
[ https://issues.apache.org/jira/browse/HIVE-10700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-10700. --- Resolution: Fixed LLAP: Log additional debug information in the scheduler --- Key: HIVE-10700 URL: https://issues.apache.org/jira/browse/HIVE-10700 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10700.1.txt Temporarily, while we're debugging issues. Change to the DBEUG log level is too verbose. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-10652) LLAP: AM task communication retry is too long
[ https://issues.apache.org/jira/browse/HIVE-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-10652. --- Resolution: Fixed Fix Version/s: llap LLAP: AM task communication retry is too long - Key: HIVE-10652 URL: https://issues.apache.org/jira/browse/HIVE-10652 Project: Hive Issue Type: Sub-task Reporter: Sergey Shelukhin Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10652.1.txt Mentioned by [~sseth] while discussing HIVE-10648. 45sec (or whatever) is a bit too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-10649) LLAP: AM gets stuck completely if one node is dead
[ https://issues.apache.org/jira/browse/HIVE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-10649. --- Resolution: Duplicate LLAP: AM gets stuck completely if one node is dead -- Key: HIVE-10649 URL: https://issues.apache.org/jira/browse/HIVE-10649 Project: Hive Issue Type: Sub-task Reporter: Sergey Shelukhin Assignee: Siddharth Seth See HIVE-10648. When AM cannot connect to a node, that appears to cause it to stall; example log, there are no other interleaving logs even though this is happening in the middle of Map 1 on TPCH q1, i.e. there are plenty of tasks scheduled. From Assigning messages I can also see tasks are scheduled to all the nodes before and after the pause, not just to the problematic node. LLAP daemons have corresponding gaps where between two fragments nothing is ran for a long time on any daemon. {noformat} 2015-05-07 12:13:46,679 INFO [Dispatcher thread: Central] impl.TaskImpl: task_1429683757595_0784_1_00_000276 Task Transitioned from SCHEDULED to RUNNING due to event T_ATTEMPT_LAUNCHED 2015-05-07 12:13:46,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 10 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:46,955 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583 2015-05-07 12:13:47,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 11 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:48,812 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 12 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:49,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 13 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:50,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 14 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:51,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 15 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:52,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 16 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:53,815 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 17 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:54,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 18 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:55,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 19 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:56,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 20 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:56,971 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583 2015-05-07 12:13:57,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 21 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:58,818 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
[jira] [Updated] (HIVE-10683) LLAP: Add a mechanism for daemons to inform the AM about killed tasks
[ https://issues.apache.org/jira/browse/HIVE-10683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10683: -- Attachment: HIVE-10683.1.txt LLAP: Add a mechanism for daemons to inform the AM about killed tasks - Key: HIVE-10683 URL: https://issues.apache.org/jira/browse/HIVE-10683 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10683.1.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10652) LLAP: AM task communication retry is too long
[ https://issues.apache.org/jira/browse/HIVE-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10652: -- Attachment: HIVE-10652.1.addendum.txt Addendum patch to remove some unused imports. LLAP: AM task communication retry is too long - Key: HIVE-10652 URL: https://issues.apache.org/jira/browse/HIVE-10652 Project: Hive Issue Type: Sub-task Reporter: Sergey Shelukhin Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10652.1.addendum.txt, HIVE-10652.1.txt Mentioned by [~sseth] while discussing HIVE-10648. 45sec (or whatever) is a bit too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-10683) LLAP: Add a mechanism for daemons to inform the AM about killed tasks
[ https://issues.apache.org/jira/browse/HIVE-10683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-10683. --- Resolution: Fixed LLAP: Add a mechanism for daemons to inform the AM about killed tasks - Key: HIVE-10683 URL: https://issues.apache.org/jira/browse/HIVE-10683 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10683.1.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10652) LLAP: AM task communication retry is too long
[ https://issues.apache.org/jira/browse/HIVE-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10652: -- Attachment: HIVE-10652.1.txt Made configurable, and defaults to 16s. LLAP: AM task communication retry is too long - Key: HIVE-10652 URL: https://issues.apache.org/jira/browse/HIVE-10652 Project: Hive Issue Type: Sub-task Reporter: Sergey Shelukhin Assignee: Siddharth Seth Attachments: HIVE-10652.1.txt Mentioned by [~sseth] while discussing HIVE-10648. 45sec (or whatever) is a bit too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10649) LLAP: AM gets stuck completely if one node is dead
[ https://issues.apache.org/jira/browse/HIVE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10649: -- Assignee: (was: Siddharth Seth) LLAP: AM gets stuck completely if one node is dead -- Key: HIVE-10649 URL: https://issues.apache.org/jira/browse/HIVE-10649 Project: Hive Issue Type: Sub-task Reporter: Sergey Shelukhin See HIVE-10648. When AM cannot connect to a node, that appears to cause it to stall; example log, there are no other interleaving logs even though this is happening in the middle of Map 1 on TPCH q1, i.e. there are plenty of tasks scheduled. From Assigning messages I can also see tasks are scheduled to all the nodes before and after the pause, not just to the problematic node. LLAP daemons have corresponding gaps where between two fragments nothing is ran for a long time on any daemon. {noformat} 2015-05-07 12:13:46,679 INFO [Dispatcher thread: Central] impl.TaskImpl: task_1429683757595_0784_1_00_000276 Task Transitioned from SCHEDULED to RUNNING due to event T_ATTEMPT_LAUNCHED 2015-05-07 12:13:46,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 10 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:46,955 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583 2015-05-07 12:13:47,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 11 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:48,812 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 12 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:49,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 13 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:50,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 14 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:51,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 15 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:52,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 16 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:53,815 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 17 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:54,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 18 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:55,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 19 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:56,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 20 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:56,971 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583 2015-05-07 12:13:57,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 21 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:58,818 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already
[jira] [Resolved] (HIVE-10730) LLAP: fix guava stopwatch conflict
[ https://issues.apache.org/jira/browse/HIVE-10730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-10730. --- Resolution: Fixed LLAP: fix guava stopwatch conflict -- Key: HIVE-10730 URL: https://issues.apache.org/jira/browse/HIVE-10730 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10730.1.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10737) LLAP: task scheduler thread-count keeps growing
[ https://issues.apache.org/jira/browse/HIVE-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549200#comment-14549200 ] Siddharth Seth commented on HIVE-10737: --- There's other OOMs on the TaskCommunicator threads before this specific OOM. The AM is obviously seeing memory pressure. This could be a result of the Orc cache, events being stored (large jobs), uberized fetchers - where data will be fetched into memory - and the AM size may not have accounted for this. LLAP: task scheduler thread-count keeps growing --- Key: HIVE-10737 URL: https://issues.apache.org/jira/browse/HIVE-10737 Project: Hive Issue Type: Sub-task Reporter: Gopal V Assignee: Siddharth Seth LLAP AppMasters die with {code} 2015-05-17 20:22:44,513 FATAL [Thread-97] yarn.YarnUncaughtExceptionHandler: Thread Thread[Thread-97,5,main] threw an Error. Shutting down now... java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1357) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134) at java.util.concurrent.Executors$DelegatedExecutorService.submit(Executors.java:681) at org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.taskAllocated(TaskSchedulerAppCallbackWrapper.java:60) at org.apache.tez.dag.app.rm.LocalTaskSchedulerService$AsyncDelegateRequestHandler.allocateTask(LocalTaskSchedulerService.java:410) at org.apache.tez.dag.app.rm.LocalTaskSchedulerService$AsyncDelegateRequestHandler.processRequest(LocalTaskSchedulerService.java:394) at org.apache.tez.dag.app.rm.LocalTaskSchedulerService$AsyncDelegateRequestHandler.run(LocalTaskSchedulerService.java:386) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10737) LLAP: task scheduler thread-count keeps growing
[ https://issues.apache.org/jira/browse/HIVE-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547624#comment-14547624 ] Siddharth Seth commented on HIVE-10737: --- [~gopalv] - do you happen to have a stack trace from the AM before it went OOM ? LLAP: task scheduler thread-count keeps growing --- Key: HIVE-10737 URL: https://issues.apache.org/jira/browse/HIVE-10737 Project: Hive Issue Type: Sub-task Reporter: Gopal V Assignee: Siddharth Seth LLAP AppMasters die with {code} 2015-05-17 20:22:44,513 FATAL [Thread-97] yarn.YarnUncaughtExceptionHandler: Thread Thread[Thread-97,5,main] threw an Error. Shutting down now... java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1357) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134) at java.util.concurrent.Executors$DelegatedExecutorService.submit(Executors.java:681) at org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.taskAllocated(TaskSchedulerAppCallbackWrapper.java:60) at org.apache.tez.dag.app.rm.LocalTaskSchedulerService$AsyncDelegateRequestHandler.allocateTask(LocalTaskSchedulerService.java:410) at org.apache.tez.dag.app.rm.LocalTaskSchedulerService$AsyncDelegateRequestHandler.processRequest(LocalTaskSchedulerService.java:394) at org.apache.tez.dag.app.rm.LocalTaskSchedulerService$AsyncDelegateRequestHandler.run(LocalTaskSchedulerService.java:386) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10233) Hive on LLAP: Memory manager
[ https://issues.apache.org/jira/browse/HIVE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497034#comment-14497034 ] Siddharth Seth commented on HIVE-10233: --- Looked at just the Tez Configuration changes. - Since Hive will be setting the memory explicitly, disabling the Tez scaling makes sense. That's done by setting tez.task.scale.memory.enabled = false (TezConfiguration.TEZ_TASK_SCALE_MEMORY_ENABLED). This needs to be set before creating the AM, and applies to all DAGs running in the AM. - TezRuntimeConfiguration.TEZ_RUNTIME_IO_SORT_MB, TezRuntimeConfiguration.TEZ_RUNTIME_UNORDERED_OUTPUT_BUFFER_SIZE_MB - need to convert the memory from bytes to MB before setting these properties - edgeProp.getInputMemoryNeededPercent - this needs to be a fraction (0-1) (rather than an actual percentage (0-100)). Not sure what the method gives back right now. - Missed mentioning this in the offline discussions about the properties involved, one more needs to be set for the Ordered case. (TEZ_RUNTIME_INPUT_POST_MERGE_BUFFER_PERCENT). This is a measure of how much memory will be used after the merge is complete to avoid spilling to disk. This defaults to 0, but is typically a lower value than the MergeMemory. Given that this memory is always reserved for the Input, it can just be set to the Input merge memory. There's explicit APIs which can be used to configure these properties. {code} .setValueSerializationClass(TezBytesWritableSerialization.class.getName(), null) .configureOutput().setSortBufferSize([OUT_SIZE]).done() .configureInput().setShuffleBufferFraction(IN_FRACTION).setPostMergeBufferFraction(IN_FRACTION).done() {code} Similarly for the UnorderedCase. Hive on LLAP: Memory manager Key: HIVE-10233 URL: https://issues.apache.org/jira/browse/HIVE-10233 Project: Hive Issue Type: Bug Components: Tez Affects Versions: llap Reporter: Vikram Dixit K Assignee: Vikram Dixit K Attachments: HIVE-10233-WIP-2.patch, HIVE-10233-WIP.patch We need a memory manager in llap/tez to manage the usage of memory across threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10029) LLAP: Scheduling of work from different queries within the daemon
[ https://issues.apache.org/jira/browse/HIVE-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498716#comment-14498716 ] Siddharth Seth commented on HIVE-10029: --- Yes, to the most part. We'll likely a need a follow up to provide data to the pre-emption queue. LLAP: Scheduling of work from different queries within the daemon - Key: HIVE-10029 URL: https://issues.apache.org/jira/browse/HIVE-10029 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Fix For: llap The current implementation is a simple queue - whichever query wins the race to submit work to a daemon will execute first. A policy around this may be useful - potentially a fair share, or a first query in gets all slots approach. Also, prioritiy associated with work within a query should be considered. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10335) LLAP: IndexOutOfBound in MapJoinOperator
[ https://issues.apache.org/jira/browse/HIVE-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10335: -- Fix Version/s: llap LLAP: IndexOutOfBound in MapJoinOperator Key: HIVE-10335 URL: https://issues.apache.org/jira/browse/HIVE-10335 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Fix For: llap {code} 2015-04-14 13:57:55,889 [TezTaskRunner_attempt_1428572510173_0173_2_03_14_0(container_1_0173_01_66_sseth_20150414135750_7a7c2f4f-5f2d-4645-b833-677621f087bd:2_Map 1_14_0)] ERROR org.apache.hadoop.hive.ql.exec.MapJoinOperator: Unexpected exception: Index: 0, Size: 0 java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at org.apache.hadoop.hive.ql.exec.persistence.UnwrapRowContainer.unwrap(UnwrapRowContainer.java:79) at org.apache.hadoop.hive.ql.exec.persistence.UnwrapRowContainer.first(UnwrapRowContainer.java:62) at org.apache.hadoop.hive.ql.exec.persistence.UnwrapRowContainer.first(UnwrapRowContainer.java:33) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:670) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:754) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:386) at org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.process(VectorMapJoinOperator.java:283) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837) at org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.flushOutput(VectorMapJoinOperator.java:232) at org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.closeOp(VectorMapJoinOperator.java:240) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:616) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:630) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.close(MapRecordProcessor.java:348) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:332) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10335) LLAP: IndexOutOfBound in MapJoinOperator
[ https://issues.apache.org/jira/browse/HIVE-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494916#comment-14494916 ] Siddharth Seth commented on HIVE-10335: --- Also {code} org.apache.hadoop.hive.ql.metadata.HiveException: Unexpected exception: 1024 at org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:398) at org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.process(VectorMapJoinOperator.java:283) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837) at org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.flushOutput(VectorMapJoinOperator.java:232) at org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.closeOp(VectorMapJoinOperator.java:240) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:616) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:630) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.close(MapRecordProcessor.java:348) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:332) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException: 1024 at org.apache.hadoop.hive.ql.exec.vector.VectorColumnAssignFactory$VectorLongColumnAssign.assignLong(VectorColumnAssignFactory.java:116) at org.apache.hadoop.hive.ql.exec.vector.VectorColumnAssignFactory$9.assignObjectValue(VectorColumnAssignFactory.java:296) at org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.internalForward(VectorMapJoinOperator.java:223) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:676) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:754) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:386) ... 22 more org.apache.hadoop.hive.ql.metadata.HiveException: Unexpected exception: null at org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:398) at org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.process(VectorMapJoinOperator.java:283) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837) at org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.flushOutput(VectorMapJoinOperator.java:232) at org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.closeOp(VectorMapJoinOperator.java:240) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:616) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:630) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.close(MapRecordProcessor.java:348) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:332) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172) at
[jira] [Updated] (HIVE-10229) Set conf and processor context in the constructor instead of init
[ https://issues.apache.org/jira/browse/HIVE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10229: -- Issue Type: Bug (was: Sub-task) Parent: (was: HIVE-7926) Set conf and processor context in the constructor instead of init - Key: HIVE-10229 URL: https://issues.apache.org/jira/browse/HIVE-10229 Project: Hive Issue Type: Bug Environment: Reporter: Sergey Shelukhin Assignee: Siddharth Seth Hit this on ctas13 query. {noformat} Error: Failure while running task:java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:98) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:134) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:330) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} The line is cacheKey = queryId + processorContext.getTaskVertexName() + REDUCE_PLAN_KEY; -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10229) Set conf and processor context in the constructor instead of init
[ https://issues.apache.org/jira/browse/HIVE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10229: -- Attachment: HIVE-10229.1.patch Fairly simple patch to set jconf and context during construction. Set conf and processor context in the constructor instead of init - Key: HIVE-10229 URL: https://issues.apache.org/jira/browse/HIVE-10229 Project: Hive Issue Type: Bug Environment: Reporter: Sergey Shelukhin Assignee: Siddharth Seth Fix For: 1.2.0 Attachments: HIVE-10229.1.patch Hit this on ctas13 query. {noformat} Error: Failure while running task:java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:98) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:134) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:330) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} The line is cacheKey = queryId + processorContext.getTaskVertexName() + REDUCE_PLAN_KEY; -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10229) LLAP: NPE in ReduceRecordProcessor
[ https://issues.apache.org/jira/browse/HIVE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482118#comment-14482118 ] Siddharth Seth commented on HIVE-10229: --- Yep. Same issue I saw. ProcessorContext is null. I'm going to upload a patch for trunk which sets the conf and context in the constructor instead of the init method. LLAP: NPE in ReduceRecordProcessor -- Key: HIVE-10229 URL: https://issues.apache.org/jira/browse/HIVE-10229 Project: Hive Issue Type: Sub-task Environment: Reporter: Sergey Shelukhin Assignee: Gunther Hagleitner Hit this on ctas13 query. {noformat} Error: Failure while running task:java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:98) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:134) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:330) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} The line is cacheKey = queryId + processorContext.getTaskVertexName() + REDUCE_PLAN_KEY; -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HIVE-10229) Set conf and processor context in the constructor instead of init
[ https://issues.apache.org/jira/browse/HIVE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth reassigned HIVE-10229: - Assignee: Siddharth Seth (was: Gunther Hagleitner) Set conf and processor context in the constructor instead of init - Key: HIVE-10229 URL: https://issues.apache.org/jira/browse/HIVE-10229 Project: Hive Issue Type: Sub-task Environment: Reporter: Sergey Shelukhin Assignee: Siddharth Seth Hit this on ctas13 query. {noformat} Error: Failure while running task:java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:98) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:134) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:330) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} The line is cacheKey = queryId + processorContext.getTaskVertexName() + REDUCE_PLAN_KEY; -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10229) Set conf and processor context in the constructor instead of init
[ https://issues.apache.org/jira/browse/HIVE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10229: -- Summary: Set conf and processor context in the constructor instead of init (was: LLAP: NPE in ReduceRecordProcessor) Set conf and processor context in the constructor instead of init - Key: HIVE-10229 URL: https://issues.apache.org/jira/browse/HIVE-10229 Project: Hive Issue Type: Sub-task Environment: Reporter: Sergey Shelukhin Assignee: Gunther Hagleitner Hit this on ctas13 query. {noformat} Error: Failure while running task:java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:98) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:134) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:330) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} The line is cacheKey = queryId + processorContext.getTaskVertexName() + REDUCE_PLAN_KEY; -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HIVE-10025) LLAP: Queued work times out
[ https://issues.apache.org/jira/browse/HIVE-10025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth reassigned HIVE-10025: - Assignee: Siddharth Seth LLAP: Queued work times out --- Key: HIVE-10025 URL: https://issues.apache.org/jira/browse/HIVE-10025 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap If a daemon holds a task in queue for a long time, it'll eventually time out - but isn't removed from the queue. Ideally, it shouldn't be allowed to time out. Otherwise, handle the timeout so that the task doesn't run - or starts and fails - likely a change in the TaskCommunicator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-10013) NPE in LLAP logs in heartbeat
[ https://issues.apache.org/jira/browse/HIVE-10013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-10013. --- Resolution: Done This should be fixed as part of TEZ-2257. Please re-open if seen again. NPE in LLAP logs in heartbeat - Key: HIVE-10013 URL: https://issues.apache.org/jira/browse/HIVE-10013 Project: Hive Issue Type: Sub-task Reporter: Sergey Shelukhin {noformat} 2015-03-18 17:28:37,559 [TezTaskRunner_attempt_1424502260528_1294_1_00_25_0(container_1_1294_01_26_sershe_20150318172752_5ce4647e-177c-4b1e-8dfa-462230735854:1_Map 1_25_0)] INFO org.apache.tez.runtime.task.TezTaskRunner: Encounted an error while executing task: attempt_1424502260528_1294_1_00_25_0 java.lang.NullPointerException at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$400(TaskReporter.java:120) at org.apache.tez.runtime.task.TaskReporter.addEvents(TaskReporter.java:386) at org.apache.tez.runtime.task.TezTaskRunner.addEvents(TezTaskRunner.java:278) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.sendTaskGeneratedEvents(LogicalIOProcessorRuntimeTask.java:596) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:355) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2015-03-18 17:28:37,559 [TezTaskRunner_attempt_1424502260528_1294_1_00_25_0(container_1_1294_01_26_sershe_20150318172752_5ce4647e-177c-4b1e-8dfa-462230735854:1_Map 1_25_0)] INFO org.apache.tez.runtime.task.TezTaskRunner: Ignoring the following exception since a previous exception is already registered java.lang.NullPointerException at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java:120) at org.apache.tez.runtime.task.TaskReporter.taskFailed(TaskReporter.java:382) at org.apache.tez.runtime.task.TezTaskRunner.sendFailure(TezTaskRunner.java:260) at org.apache.tez.runtime.task.TezTaskRunner.access$600(TezTaskRunner.java:52) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:227) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-10025) LLAP: Queued work times out
[ https://issues.apache.org/jira/browse/HIVE-10025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-10025. --- Resolution: Fixed LLAP: Queued work times out --- Key: HIVE-10025 URL: https://issues.apache.org/jira/browse/HIVE-10025 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10025.1.txt If a daemon holds a task in queue for a long time, it'll eventually time out - but isn't removed from the queue. Ideally, it shouldn't be allowed to time out. Otherwise, handle the timeout so that the task doesn't run - or starts and fails - likely a change in the TaskCommunicator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10157) Make use of the timed version of getDagStatus in TezJobMonitor
[ https://issues.apache.org/jira/browse/HIVE-10157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10157: -- Fix Version/s: 1.2.0 Make use of the timed version of getDagStatus in TezJobMonitor -- Key: HIVE-10157 URL: https://issues.apache.org/jira/browse/HIVE-10157 Project: Hive Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10185) LLAP: LLAP IO doesn't get invoked inside MiniTezCluster q tests
[ https://issues.apache.org/jira/browse/HIVE-10185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393097#comment-14393097 ] Siddharth Seth commented on HIVE-10185: --- MiniLlapCluster is not used yet. There's a jira open to wire it in. The cache should be usable in containers though, with the correct configuration ? LLAP: LLAP IO doesn't get invoked inside MiniTezCluster q tests --- Key: HIVE-10185 URL: https://issues.apache.org/jira/browse/HIVE-10185 Project: Hive Issue Type: Sub-task Reporter: Sergey Shelukhin Assignee: Siddharth Seth Took me a while to understand that it's not working. It might not be getting initialized inside the container processes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10012) LLAP: Hive sessions run before Slider registers to YARN registry fail to launch
[ https://issues.apache.org/jira/browse/HIVE-10012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14490497#comment-14490497 ] Siddharth Seth commented on HIVE-10012: --- Glanced over. Mostly looks good to me. This removes some of the log messages when a host is selected for locality, which may be useful for debugging. Also there's a check for local addresses which needs to be added back to the FixedRegistryImpl. {code} inetAddress = InetAddress.getByName(host); if (NetUtils.isLocalAddress(inetAddress)) { {code} Required to match the hostname reported by a daemon and the one used by the scheduler. LLAP: Hive sessions run before Slider registers to YARN registry fail to launch --- Key: HIVE-10012 URL: https://issues.apache.org/jira/browse/HIVE-10012 Project: Hive Issue Type: Sub-task Affects Versions: llap Reporter: Gopal V Assignee: Gopal V Fix For: llap Attachments: HIVE-10012.1.patch, HIVE-10012.wip1.patch The LLAP YARN registry only registers entries after at least one daemon is up. Any Tez session starting before that will end up with an error listing zookeeper directories. {code} 2015-03-18 16:54:21,392 FATAL [main] app.DAGAppMaster: Error starting DAGAppMaster org.apache.hadoop.service.ServiceStateException: org.apache.hadoop.fs.PathNotFoundException: `/users/sershe/services/org-apache-hive/llap0/components/workers': {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10279) LLAP: Allow the runtime to check whether a task can run to completion
[ https://issues.apache.org/jira/browse/HIVE-10279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10279: -- Fix Version/s: llap LLAP: Allow the runtime to check whether a task can run to completion - Key: HIVE-10279 URL: https://issues.apache.org/jira/browse/HIVE-10279 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap As part of the pre-empting running tasks, and deciding which tasks can run - allow the runtime to check whether a queued or running task has all it's sources complete and can run through to completion, without waiting for sources to finish. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-10767) LLAP: Improve the way task finishable information is processed
[ https://issues.apache.org/jira/browse/HIVE-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-10767. --- Resolution: Fixed Fix Version/s: llap LLAP: Improve the way task finishable information is processed -- Key: HIVE-10767 URL: https://issues.apache.org/jira/browse/HIVE-10767 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10767.1.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (HIVE-10764) LLAP: Wait queue scheduler goes into tight loop
[ https://issues.apache.org/jira/browse/HIVE-10764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth reopened HIVE-10764: --- LLAP: Wait queue scheduler goes into tight loop --- Key: HIVE-10764 URL: https://issues.apache.org/jira/browse/HIVE-10764 Project: Hive Issue Type: Sub-task Affects Versions: llap Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Fix For: llap Attachments: HIVE-10764.patch {code} if (!task.canFinish() || numSlotsAvailable.get() == 0) { {code} this condition makes it to run into tight loop if no slots available and if the task is finishable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-10764) LLAP: Wait queue scheduler goes into tight loop
[ https://issues.apache.org/jira/browse/HIVE-10764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-10764. --- Resolution: Implemented Done as part of HIVE-10767. The patch here was reverted. LLAP: Wait queue scheduler goes into tight loop --- Key: HIVE-10764 URL: https://issues.apache.org/jira/browse/HIVE-10764 Project: Hive Issue Type: Sub-task Affects Versions: llap Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Fix For: llap Attachments: HIVE-10764.patch {code} if (!task.canFinish() || numSlotsAvailable.get() == 0) { {code} this condition makes it to run into tight loop if no slots available and if the task is finishable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10767) LLAP: Improve the way task finishable information is processed
[ https://issues.apache.org/jira/browse/HIVE-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10767: -- Attachment: HIVE-10767.1.txt LLAP: Improve the way task finishable information is processed -- Key: HIVE-10767 URL: https://issues.apache.org/jira/browse/HIVE-10767 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: HIVE-10767.1.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10756) LLAP: Misc changes to daemon scheduling
[ https://issues.apache.org/jira/browse/HIVE-10756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10756: -- Attachment: HIVE-10756.1.txt [~prasanth_j] - could you take a quick look please. LLAP: Misc changes to daemon scheduling --- Key: HIVE-10756 URL: https://issues.apache.org/jira/browse/HIVE-10756 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10756.1.txt Running the completion callback in a separate thread to avoid potentially unnecessary preemptions. Sending out a kill to the AM only if the task was actually killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-10756) LLAP: Misc changes to daemon scheduling
[ https://issues.apache.org/jira/browse/HIVE-10756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-10756. --- Resolution: Fixed Thanks. Committed. LLAP: Misc changes to daemon scheduling --- Key: HIVE-10756 URL: https://issues.apache.org/jira/browse/HIVE-10756 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10756.1.txt Running the completion callback in a separate thread to avoid potentially unnecessary preemptions. Sending out a kill to the AM only if the task was actually killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-10765) LLAP: NPE when calling abort on the TezProcessor
[ https://issues.apache.org/jira/browse/HIVE-10765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-10765. --- Resolution: Fixed Fix Version/s: llap Assignee: Siddharth Seth LLAP: NPE when calling abort on the TezProcessor Key: HIVE-10765 URL: https://issues.apache.org/jira/browse/HIVE-10765 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Priority: Critical Fix For: llap Attachments: HIVE-10765.1.txt, HIVE-10765.2.txt {code} 2015-05-19 19:48:42,827 [Wait-Queue-Scheduler-0(null)] ERROR org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService: Wait queue scheduler worker exited with failure! java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.abort(TezProcessor.java:177) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.abortTask(LogicalIOProcessorRuntimeTask.java:698) at org.apache.tez.runtime.task.TaskRunner2Callable.interruptTask(TaskRunner2Callable.java:118) at org.apache.tez.runtime.task.TezTaskRunner2.killTask(TezTaskRunner2.java:261) at org.apache.hadoop.hive.llap.daemon.impl.TaskRunnerCallable.killTask(TaskRunnerCallable.java:240) at org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService.trySchedule(TaskExecutorService.java:262) at org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService.access$700(TaskExecutorService.java:64) at org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$WaitQueueWorker.run(TaskExecutorService.java:162) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} rrProc should be volatile. There likely need to be some checks around it to ensure it's setup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10765) LLAP: NPE when calling abort on the TezProcessor
[ https://issues.apache.org/jira/browse/HIVE-10765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10765: -- Attachment: HIVE-10765.2.txt Removed the volatile modifier which was part of the initial test patch. Committing. LLAP: NPE when calling abort on the TezProcessor Key: HIVE-10765 URL: https://issues.apache.org/jira/browse/HIVE-10765 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Priority: Critical Attachments: HIVE-10765.1.txt, HIVE-10765.2.txt {code} 2015-05-19 19:48:42,827 [Wait-Queue-Scheduler-0(null)] ERROR org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService: Wait queue scheduler worker exited with failure! java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.abort(TezProcessor.java:177) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.abortTask(LogicalIOProcessorRuntimeTask.java:698) at org.apache.tez.runtime.task.TaskRunner2Callable.interruptTask(TaskRunner2Callable.java:118) at org.apache.tez.runtime.task.TezTaskRunner2.killTask(TezTaskRunner2.java:261) at org.apache.hadoop.hive.llap.daemon.impl.TaskRunnerCallable.killTask(TaskRunnerCallable.java:240) at org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService.trySchedule(TaskExecutorService.java:262) at org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService.access$700(TaskExecutorService.java:64) at org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$WaitQueueWorker.run(TaskExecutorService.java:162) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} rrProc should be volatile. There likely need to be some checks around it to ensure it's setup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HIVE-10779) LLAP: Daemons should shutdown in case of fatal errors
[ https://issues.apache.org/jira/browse/HIVE-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth reassigned HIVE-10779: - Assignee: Siddharth Seth LLAP: Daemons should shutdown in case of fatal errors - Key: HIVE-10779 URL: https://issues.apache.org/jira/browse/HIVE-10779 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: HIVE-10779.1.txt For example, the scheduler loop exiting. Currently they end up getting stuck - while still accepting new work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10779) LLAP: Daemons should shutdown in case of fatal errors
[ https://issues.apache.org/jira/browse/HIVE-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10779: -- Attachment: HIVE-10779.1.txt Patch adds an UncauhtExceptionHandler and a shutdown hook to stop services. LLAP: Daemons should shutdown in case of fatal errors - Key: HIVE-10779 URL: https://issues.apache.org/jira/browse/HIVE-10779 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Attachments: HIVE-10779.1.txt For example, the scheduler loop exiting. Currently they end up getting stuck - while still accepting new work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-10779) LLAP: Daemons should shutdown in case of fatal errors
[ https://issues.apache.org/jira/browse/HIVE-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-10779. --- Resolution: Fixed Fix Version/s: llap Committed to the llap branch. LLAP: Daemons should shutdown in case of fatal errors - Key: HIVE-10779 URL: https://issues.apache.org/jira/browse/HIVE-10779 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10779.1.txt For example, the scheduler loop exiting. Currently they end up getting stuck - while still accepting new work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-10961) LLAP: ShuffleHandler + Submit work init race condition
[ https://issues.apache.org/jira/browse/HIVE-10961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated HIVE-10961: -- Attachment: HIVE-10961.1.txt LLAP: ShuffleHandler + Submit work init race condition -- Key: HIVE-10961 URL: https://issues.apache.org/jira/browse/HIVE-10961 Project: Hive Issue Type: Sub-task Affects Versions: llap Reporter: Gopal V Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10961.1.txt When flexing in a new node, it accepts DAG requests before the shuffle handler is setup, causing fatals {code} DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:2 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1433459966952_0729_1_00, diagnostics=[Task failed, taskId=task_1t at com.google.common.base.Preconditions.checkState(Preconditions.java:145) at org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler.get(ShuffleHandler.java:353) at org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl.submitWork(ContainerRunnerImpl.java:192) at org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon.submitWork(LlapDaemon.java:301) at org.apache.hadoop.hive.llap.daemon.impl.LlapDaemonProtocolServerImpl.submitWork(LlapDaemonProtocolServerImpl.java:75) at org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$LlapDaemonProtocol$2.callBlockingMethod(LlapDaemonProtocolProtos.java:12094) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:972) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2085) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2081) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1654) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2081) ], TaskAttempt 1 failed, info=[org.apache.hadoop.ipc.RemoteException(java.lang.IllegalStateException): ShuffleHandler must be started before invoking get at com.google.common.base.Preconditions.checkState(Preconditions.java:145) at org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler.get(ShuffleHandler.java:353) at org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl.submitWork(ContainerRunnerImpl.java:192) at org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon.submitWork(LlapDaemon.java:301) at org.apache.hadoop.hive.llap.daemon.impl.LlapDaemonProtocolServerImpl.submitWork(LlapDaemonProtocolServerImpl.java:75) at org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$LlapDaemonProtocol$2.callBlockingMethod(LlapDaemonProtocolProtos.java:12094) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:972) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2085) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-10961) LLAP: ShuffleHandler + Submit work init race condition
[ https://issues.apache.org/jira/browse/HIVE-10961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-10961. --- Resolution: Fixed LLAP: ShuffleHandler + Submit work init race condition -- Key: HIVE-10961 URL: https://issues.apache.org/jira/browse/HIVE-10961 Project: Hive Issue Type: Sub-task Affects Versions: llap Reporter: Gopal V Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10961.1.txt When flexing in a new node, it accepts DAG requests before the shuffle handler is setup, causing fatals {code} DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:2 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1433459966952_0729_1_00, diagnostics=[Task failed, taskId=task_1t at com.google.common.base.Preconditions.checkState(Preconditions.java:145) at org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler.get(ShuffleHandler.java:353) at org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl.submitWork(ContainerRunnerImpl.java:192) at org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon.submitWork(LlapDaemon.java:301) at org.apache.hadoop.hive.llap.daemon.impl.LlapDaemonProtocolServerImpl.submitWork(LlapDaemonProtocolServerImpl.java:75) at org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$LlapDaemonProtocol$2.callBlockingMethod(LlapDaemonProtocolProtos.java:12094) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:972) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2085) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2081) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1654) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2081) ], TaskAttempt 1 failed, info=[org.apache.hadoop.ipc.RemoteException(java.lang.IllegalStateException): ShuffleHandler must be started before invoking get at com.google.common.base.Preconditions.checkState(Preconditions.java:145) at org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler.get(ShuffleHandler.java:353) at org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl.submitWork(ContainerRunnerImpl.java:192) at org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon.submitWork(LlapDaemon.java:301) at org.apache.hadoop.hive.llap.daemon.impl.LlapDaemonProtocolServerImpl.submitWork(LlapDaemonProtocolServerImpl.java:75) at org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$LlapDaemonProtocol$2.callBlockingMethod(LlapDaemonProtocolProtos.java:12094) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:972) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2085) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10947) LLAP: preemption appears to count against failure count for the task
[ https://issues.apache.org/jira/browse/HIVE-10947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579604#comment-14579604 ] Siddharth Seth commented on HIVE-10947: --- If this happens again, please capture the logs. I'm not sure these tasks were actually preempted. They may have failed for other reasons. THere's 20 additional attempts, most of which were KILLED (likely due to preemption) before the 2 FAILED aatempts - which caused the task to fail. LLAP: preemption appears to count against failure count for the task Key: HIVE-10947 URL: https://issues.apache.org/jira/browse/HIVE-10947 Project: Hive Issue Type: Sub-task Reporter: Sergey Shelukhin Assignee: Siddharth Seth Looks like the following stack in very parallel workload counts as task error and DAG fails: {noformat} : Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1433459966952_0482_4_03, diagnostics=[Task failed, taskId=task_1433459966952_0482_4_03_22, diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 killed, TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, TaskAttempt 12 killed, TaskAttempt 13 killed, TaskAttempt 14 killed, TaskAttempt 15 killed, TaskAttempt 16 killed, TaskAttempt 17 killed, TaskAttempt 18 killed, TaskAttempt 19 failed, info=[Error: Failure while running task: attempt_1433459966952_0482_4_03_22_19:java.lang.RuntimeException: java.lang.RuntimeException: Map operator initialization failed at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:181) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:146) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:349) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:71) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:60) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1654) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:60) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:35) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: Map operator initialization failed at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:256) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:157) ... 14 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Async initialization failed at org.apache.hadoop.hive.ql.exec.Operator.completeInitialization(Operator.java:416) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:388) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:511) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:464) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:378) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:241) ... 15 more Caused by: java.util.concurrent.CancellationException at java.util.concurrent.FutureTask.report(FutureTask.java:121) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.hive.ql.exec.Operator.completeInitialization(Operator.java:408) ... 20 more ], TaskAttempt 20 failed, info=[Error: Failure while running task: attempt_1433459966952_0482_4_03_22_20:java.lang.RuntimeException: java.lang.RuntimeException: Map operator initialization failed at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:181) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:146) at
[jira] [Commented] (HIVE-11046) Filesystem Closed Exception
[ https://issues.apache.org/jira/browse/HIVE-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592478#comment-14592478 ] Siddharth Seth commented on HIVE-11046: --- [~raj_velu] - bunch of questions. Do you have additional logs from the container where this error was seen ? Also any steps to reproduce and how often are you able to reproduce this ? Is this using the Tez 0.7.0 release or a snapshot ? Filesystem Closed Exception --- Key: HIVE-11046 URL: https://issues.apache.org/jira/browse/HIVE-11046 Project: Hive Issue Type: Bug Components: Hive, Tez Affects Versions: 0.7.0, 1.2.0 Environment: Hive 1.2.0, Tez0.7.0, HDP2.2, Hadoop 2.6 Reporter: Soundararajan Velu TaskAttempt 2 failed, info=[Error: Failure while running task:java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Filesystem closed at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:345) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Filesystem closed at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:71) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:290) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:148) ... 14 more Caused by: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:795) at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:629) at java.io.FilterInputStream.close(FilterInputStream.java:181) at org.apache.hadoop.io.compress.DecompressorStream.close(DecompressorStream.java:205) at org.apache.hadoop.util.LineReader.close(LineReader.java:150) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:282) at org.apache.hadoop.hive.ql.io.HiveRecordReader.doClose(HiveRecordReader.java:50) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.close(HiveContextAwareRecordReader.java:104) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:170) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:138) at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:61) ... 16 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-10762) LLAP: Kill any fragments running in a daemon when a query completes
[ https://issues.apache.org/jira/browse/HIVE-10762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved HIVE-10762. --- Resolution: Fixed Committed to the llap branch. LLAP: Kill any fragments running in a daemon when a query completes --- Key: HIVE-10762 URL: https://issues.apache.org/jira/browse/HIVE-10762 Project: Hive Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: llap Attachments: HIVE-10762.1.txt A query may complete due to failure or being KILLED. Fragments running in daemons should be killed in these scenarios. -- This message was sent by Atlassian JIRA (v6.3.4#6332)