[jira] [Commented] (HIVE-9976) LLAP: Possible race condition in DynamicPartitionPruner for 200ms tasks

2015-03-16 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363909#comment-14363909
 ] 

Siddharth Seth commented on HIVE-9976:
--

I'll take a look. Assuming this was run with Tez 0.7 snapshot ?

 LLAP: Possible race condition in DynamicPartitionPruner for 200ms tasks
 

 Key: HIVE-9976
 URL: https://issues.apache.org/jira/browse/HIVE-9976
 Project: Hive
  Issue Type: Sub-task
  Components: Tez
Affects Versions: llap
Reporter: Gopal V
Assignee: Gunther Hagleitner
 Attachments: llap_vertex_200ms.png


 Race condition in the DynamicPartitionPruner between 
 DynamicPartitionPruner::processVertex() and 
 DynamicPartitionpruner::addEvent() for tasks which respond with both the 
 result and success in a single heartbeat sequence.
 {code}
 2015-03-16 07:05:01,589 ERROR [InputInitializer [Map 1] #0] 
 tez.DynamicPartitionPruner: Expecting: 1, received: 0
 2015-03-16 07:05:01,590 ERROR [Dispatcher thread: Central] impl.VertexImpl: 
 Vertex Input: store_sales initializer failed, 
 vertex=vertex_1424502260528_1113_4_04 [Map 1]
 org.apache.tez.dag.app.dag.impl.AMUserCodeException: 
 org.apache.hadoop.hive.ql.metadata.HiveException: Incorrect event count in 
 dynamic parition pruning
 {code}
 !llap_vertex_200ms.png!
 All 4 upstream vertices of Map 1 need to finish within ~200ms to trigger 
 this, which seems to be consistently happening with LLAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9756) LLAP: use log4j 2 for llap

2015-03-16 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363919#comment-14363919
 ] 

Siddharth Seth commented on HIVE-9756:
--

[~gopalv] - Tez is moving to slf4j in the 0.7 release (TEZ-2176). 
Unfortauntely, Hadoop provides log4j as well - so this may be problematic 
anyway. We'll find out once the Tez patch goes in. 

 LLAP: use log4j 2 for llap
 --

 Key: HIVE-9756
 URL: https://issues.apache.org/jira/browse/HIVE-9756
 Project: Hive
  Issue Type: Sub-task
Reporter: Gunther Hagleitner
Assignee: Gopal V

 For the INFO logging, we'll need to use the log4j-jcl 2.x upgrade-path to get 
 throughput friendly logging.
 http://logging.apache.org/log4j/2.0/manual/async.html#Performance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9999) LLAP: Handle task rejection from daemons in the AM

2015-03-17 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-:
-
Attachment: HIVE-.1.patch

 LLAP: Handle task rejection from daemons in the AM
 --

 Key: HIVE-
 URL: https://issues.apache.org/jira/browse/HIVE-
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-9999) LLAP: Handle task rejection from daemons in the AM

2015-03-17 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-.
--
Resolution: Fixed

 LLAP: Handle task rejection from daemons in the AM
 --

 Key: HIVE-
 URL: https://issues.apache.org/jira/browse/HIVE-
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9912) LLAP: Improvements to the Shuffle handler to avoid unnecessary disk scans

2015-03-19 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-9912:
-
Attachment: HIVE-9912.1.txt

Patch to cache files which have been previously scanned, and add a watcher for 
files being created. Also reduces the logging on new work submissions.

 LLAP: Improvements to the Shuffle handler to avoid unnecessary disk scans
 -

 Key: HIVE-9912
 URL: https://issues.apache.org/jira/browse/HIVE-9912
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-9912) LLAP: Improvements to the Shuffle handler to avoid unnecessary disk scans

2015-03-19 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-9912.
--
Resolution: Fixed

 LLAP: Improvements to the Shuffle handler to avoid unnecessary disk scans
 -

 Key: HIVE-9912
 URL: https://issues.apache.org/jira/browse/HIVE-9912
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-9912.1.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10025) LLAP: Queued work times out

2015-03-19 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10025:
--
Issue Type: Sub-task  (was: Improvement)
Parent: HIVE-7926

 LLAP: Queued work times out
 ---

 Key: HIVE-10025
 URL: https://issues.apache.org/jira/browse/HIVE-10025
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth

 If a daemon holds a task in queue for a long time, it'll eventually time out 
 - but isn't removed from the queue. Ideally, it shouldn't be allowed to time 
 out. Otherwise, handle the timeout so that the task doesn't run - or starts 
 and fails - likely a change in the TaskCommunicator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9912) LLAP: Improvements to the Shuffle handler to avoid unnecessary disk scans

2015-03-19 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-9912:
-
Attachment: (was: HIVE-9912.1.txt)

 LLAP: Improvements to the Shuffle handler to avoid unnecessary disk scans
 -

 Key: HIVE-9912
 URL: https://issues.apache.org/jira/browse/HIVE-9912
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9912) LLAP: Improvements to the Shuffle handler to avoid unnecessary disk scans

2015-03-19 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-9912:
-
Attachment: HIVE-9912.1.txt

 LLAP: Improvements to the Shuffle handler to avoid unnecessary disk scans
 -

 Key: HIVE-9912
 URL: https://issues.apache.org/jira/browse/HIVE-9912
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-9912.1.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10026) LLAP: AM should get notifications on daemons going down or restarting

2015-03-19 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10026:
--
Fix Version/s: llap

 LLAP: AM should get notifications on daemons going down or restarting
 -

 Key: HIVE-10026
 URL: https://issues.apache.org/jira/browse/HIVE-10026
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
 Fix For: llap


 There's lost state otherwise, which can cause queries to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10025) LLAP: Queued work times out

2015-03-19 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10025:
--
Fix Version/s: llap

 LLAP: Queued work times out
 ---

 Key: HIVE-10025
 URL: https://issues.apache.org/jira/browse/HIVE-10025
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
 Fix For: llap


 If a daemon holds a task in queue for a long time, it'll eventually time out 
 - but isn't removed from the queue. Ideally, it shouldn't be allowed to time 
 out. Otherwise, handle the timeout so that the task doesn't run - or starts 
 and fails - likely a change in the TaskCommunicator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9807) LLAP: Add event logging for execution elements

2015-03-19 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14370599#comment-14370599
 ] 

Siddharth Seth commented on HIVE-9807:
--

This doesn't need any additional user documentation. It's meant for consumption 
by tools.

 LLAP: Add event logging for execution elements
 --

 Key: HIVE-9807
 URL: https://issues.apache.org/jira/browse/HIVE-9807
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-9807.1.patch, HIVE-9807.2.patch, llap-executors.png


 For analysis of runtimes, submit/start delays, interleaving etc.
 !llap-executors.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9808) LLAP: Push work into daemons instead of the current pull

2015-03-09 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-9808:
-
Attachment: HIVE-9808.2.txt

Rebased patch. Will commit shortly; this one was painful to rebase.
There's some UGI / closeAllForFileSystem changes which will need to be worked 
on in a follow up.

 LLAP: Push work into daemons instead of the current pull
 

 Key: HIVE-9808
 URL: https://issues.apache.org/jira/browse/HIVE-9808
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-9808.1.txt, HIVE-9808.2.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-9808) LLAP: Push work into daemons instead of the current pull

2015-03-09 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-9808.
--
Resolution: Fixed

 LLAP: Push work into daemons instead of the current pull
 

 Key: HIVE-9808
 URL: https://issues.apache.org/jira/browse/HIVE-9808
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-9808.1.txt, HIVE-9808.2.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9775) LLAP: Add a MiniLLAPCluster for tests

2015-03-09 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-9775:
-
Attachment: HIVE-9775.2.patch

Re-based patch.

 LLAP: Add a MiniLLAPCluster for tests
 -

 Key: HIVE-9775
 URL: https://issues.apache.org/jira/browse/HIVE-9775
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-9775.1.patch, HIVE-9775.2.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-9910) LLAP: Update usage of APIs changed by TEZ-2175 and TEZ-2187

2015-03-10 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-9910.
--
Resolution: Fixed

Already committed.

 LLAP: Update usage of APIs changed by TEZ-2175 and TEZ-2187
 ---

 Key: HIVE-9910
 URL: https://issues.apache.org/jira/browse/HIVE-9910
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-9910.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9910) LLAP: Update usage of APIs changed by TEZ-2175 and TEZ-2187

2015-03-10 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-9910:
-
Attachment: HIVE-9910.1.patch

Trivial patch.

 LLAP: Update usage of APIs changed by TEZ-2175 and TEZ-2187
 ---

 Key: HIVE-9910
 URL: https://issues.apache.org/jira/browse/HIVE-9910
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-9910.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9891) LLAP: disable plan caching

2015-03-08 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352337#comment-14352337
 ] 

Siddharth Seth commented on HIVE-9891:
--

Would be nice if the plan were immutable - I'm guessing that's a big change and 
is an item for later.
Caching the plan and cloning it for each execution, rather than deserializing 
it each time may be another option.

 LLAP: disable plan caching
 --

 Key: HIVE-9891
 URL: https://issues.apache.org/jira/browse/HIVE-9891
 Project: Hive
  Issue Type: Sub-task
Reporter: Gunther Hagleitner
Assignee: Gunther Hagleitner
 Attachments: HIVE-9891.1.patch


 Can't share the same plan objects in LLAP as they are used concurrently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9976) Possible race condition in DynamicPartitionPruner for 200ms tasks

2015-03-24 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-9976:
-
Affects Version/s: (was: 1.1.0)
   1.0.0
Fix Version/s: (was: 1.1.1)
   (was: 1.2.0)
   1.0.1
 Assignee: Siddharth Seth  (was: Gunther Hagleitner)

This is not limited to LLAP. Assigning to myself - to change the handling of 
vertex success / init events.

 Possible race condition in DynamicPartitionPruner for 200ms tasks
 --

 Key: HIVE-9976
 URL: https://issues.apache.org/jira/browse/HIVE-9976
 Project: Hive
  Issue Type: Sub-task
  Components: Tez
Affects Versions: 1.0.0
Reporter: Gopal V
Assignee: Siddharth Seth
 Fix For: 1.0.1

 Attachments: llap_vertex_200ms.png


 Race condition in the DynamicPartitionPruner between 
 DynamicPartitionPruner::processVertex() and 
 DynamicPartitionpruner::addEvent() for tasks which respond with both the 
 result and success in a single heartbeat sequence.
 {code}
 2015-03-16 07:05:01,589 ERROR [InputInitializer [Map 1] #0] 
 tez.DynamicPartitionPruner: Expecting: 1, received: 0
 2015-03-16 07:05:01,590 ERROR [Dispatcher thread: Central] impl.VertexImpl: 
 Vertex Input: store_sales initializer failed, 
 vertex=vertex_1424502260528_1113_4_04 [Map 1]
 org.apache.tez.dag.app.dag.impl.AMUserCodeException: 
 org.apache.hadoop.hive.ql.metadata.HiveException: Incorrect event count in 
 dynamic parition pruning
 {code}
 !llap_vertex_200ms.png!
 All 4 upstream vertices of Map 1 need to finish within ~200ms to trigger 
 this, which seems to be consistently happening with LLAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9976) Possible race condition in DynamicPartitionPruner for 200ms tasks

2015-03-24 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-9976:
-
Issue Type: Bug  (was: Sub-task)
Parent: (was: HIVE-7926)

 Possible race condition in DynamicPartitionPruner for 200ms tasks
 --

 Key: HIVE-9976
 URL: https://issues.apache.org/jira/browse/HIVE-9976
 Project: Hive
  Issue Type: Bug
  Components: Tez
Affects Versions: 1.0.0
Reporter: Gopal V
Assignee: Siddharth Seth
 Fix For: 1.0.1

 Attachments: llap_vertex_200ms.png


 Race condition in the DynamicPartitionPruner between 
 DynamicPartitionPruner::processVertex() and 
 DynamicPartitionpruner::addEvent() for tasks which respond with both the 
 result and success in a single heartbeat sequence.
 {code}
 2015-03-16 07:05:01,589 ERROR [InputInitializer [Map 1] #0] 
 tez.DynamicPartitionPruner: Expecting: 1, received: 0
 2015-03-16 07:05:01,590 ERROR [Dispatcher thread: Central] impl.VertexImpl: 
 Vertex Input: store_sales initializer failed, 
 vertex=vertex_1424502260528_1113_4_04 [Map 1]
 org.apache.tez.dag.app.dag.impl.AMUserCodeException: 
 org.apache.hadoop.hive.ql.metadata.HiveException: Incorrect event count in 
 dynamic parition pruning
 {code}
 !llap_vertex_200ms.png!
 All 4 upstream vertices of Map 1 need to finish within ~200ms to trigger 
 this, which seems to be consistently happening with LLAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9976) Possible race condition in DynamicPartitionPruner for 200ms tasks

2015-03-24 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-9976:
-
Fix Version/s: 1.1.1
   1.2.0

 Possible race condition in DynamicPartitionPruner for 200ms tasks
 --

 Key: HIVE-9976
 URL: https://issues.apache.org/jira/browse/HIVE-9976
 Project: Hive
  Issue Type: Sub-task
  Components: Tez
Affects Versions: 1.1.0
Reporter: Gopal V
Assignee: Gunther Hagleitner
 Fix For: 1.2.0, 1.1.1

 Attachments: llap_vertex_200ms.png


 Race condition in the DynamicPartitionPruner between 
 DynamicPartitionPruner::processVertex() and 
 DynamicPartitionpruner::addEvent() for tasks which respond with both the 
 result and success in a single heartbeat sequence.
 {code}
 2015-03-16 07:05:01,589 ERROR [InputInitializer [Map 1] #0] 
 tez.DynamicPartitionPruner: Expecting: 1, received: 0
 2015-03-16 07:05:01,590 ERROR [Dispatcher thread: Central] impl.VertexImpl: 
 Vertex Input: store_sales initializer failed, 
 vertex=vertex_1424502260528_1113_4_04 [Map 1]
 org.apache.tez.dag.app.dag.impl.AMUserCodeException: 
 org.apache.hadoop.hive.ql.metadata.HiveException: Incorrect event count in 
 dynamic parition pruning
 {code}
 !llap_vertex_200ms.png!
 All 4 upstream vertices of Map 1 need to finish within ~200ms to trigger 
 this, which seems to be consistently happening with LLAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9976) Possible race condition in DynamicPartitionPruner for 200ms tasks

2015-03-24 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-9976:
-
Summary: Possible race condition in DynamicPartitionPruner for 200ms tasks 
 (was: LLAP: Possible race condition in DynamicPartitionPruner for 200ms tasks)

 Possible race condition in DynamicPartitionPruner for 200ms tasks
 --

 Key: HIVE-9976
 URL: https://issues.apache.org/jira/browse/HIVE-9976
 Project: Hive
  Issue Type: Sub-task
  Components: Tez
Affects Versions: 1.1.0
Reporter: Gopal V
Assignee: Gunther Hagleitner
 Fix For: 1.2.0, 1.1.1

 Attachments: llap_vertex_200ms.png


 Race condition in the DynamicPartitionPruner between 
 DynamicPartitionPruner::processVertex() and 
 DynamicPartitionpruner::addEvent() for tasks which respond with both the 
 result and success in a single heartbeat sequence.
 {code}
 2015-03-16 07:05:01,589 ERROR [InputInitializer [Map 1] #0] 
 tez.DynamicPartitionPruner: Expecting: 1, received: 0
 2015-03-16 07:05:01,590 ERROR [Dispatcher thread: Central] impl.VertexImpl: 
 Vertex Input: store_sales initializer failed, 
 vertex=vertex_1424502260528_1113_4_04 [Map 1]
 org.apache.tez.dag.app.dag.impl.AMUserCodeException: 
 org.apache.hadoop.hive.ql.metadata.HiveException: Incorrect event count in 
 dynamic parition pruning
 {code}
 !llap_vertex_200ms.png!
 All 4 upstream vertices of Map 1 need to finish within ~200ms to trigger 
 this, which seems to be consistently happening with LLAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9976) Possible race condition in DynamicPartitionPruner for 200ms tasks

2015-03-24 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-9976:
-
Affects Version/s: (was: llap)
   1.1.0

 Possible race condition in DynamicPartitionPruner for 200ms tasks
 --

 Key: HIVE-9976
 URL: https://issues.apache.org/jira/browse/HIVE-9976
 Project: Hive
  Issue Type: Sub-task
  Components: Tez
Affects Versions: 1.1.0
Reporter: Gopal V
Assignee: Gunther Hagleitner
 Fix For: 1.2.0, 1.1.1

 Attachments: llap_vertex_200ms.png


 Race condition in the DynamicPartitionPruner between 
 DynamicPartitionPruner::processVertex() and 
 DynamicPartitionpruner::addEvent() for tasks which respond with both the 
 result and success in a single heartbeat sequence.
 {code}
 2015-03-16 07:05:01,589 ERROR [InputInitializer [Map 1] #0] 
 tez.DynamicPartitionPruner: Expecting: 1, received: 0
 2015-03-16 07:05:01,590 ERROR [Dispatcher thread: Central] impl.VertexImpl: 
 Vertex Input: store_sales initializer failed, 
 vertex=vertex_1424502260528_1113_4_04 [Map 1]
 org.apache.tez.dag.app.dag.impl.AMUserCodeException: 
 org.apache.hadoop.hive.ql.metadata.HiveException: Incorrect event count in 
 dynamic parition pruning
 {code}
 !llap_vertex_200ms.png!
 All 4 upstream vertices of Map 1 need to finish within ~200ms to trigger 
 this, which seems to be consistently happening with LLAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9976) Possible race condition in DynamicPartitionPruner for 200ms tasks

2015-03-24 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-9976:
-
Attachment: HIVE-9976.1.patch

Patch to handle out of order events. Also initializes the pruner during Input 
construction - so that events don't show up before the pruner is initialized. 
Adds a bunch of tests.

[~hagleitn], [~vikram.dixit] - please review.

 Possible race condition in DynamicPartitionPruner for 200ms tasks
 --

 Key: HIVE-9976
 URL: https://issues.apache.org/jira/browse/HIVE-9976
 Project: Hive
  Issue Type: Bug
  Components: Tez
Affects Versions: 1.0.0
Reporter: Gopal V
Assignee: Siddharth Seth
 Fix For: 1.0.1

 Attachments: HIVE-9976.1.patch, llap_vertex_200ms.png


 Race condition in the DynamicPartitionPruner between 
 DynamicPartitionPruner::processVertex() and 
 DynamicPartitionpruner::addEvent() for tasks which respond with both the 
 result and success in a single heartbeat sequence.
 {code}
 2015-03-16 07:05:01,589 ERROR [InputInitializer [Map 1] #0] 
 tez.DynamicPartitionPruner: Expecting: 1, received: 0
 2015-03-16 07:05:01,590 ERROR [Dispatcher thread: Central] impl.VertexImpl: 
 Vertex Input: store_sales initializer failed, 
 vertex=vertex_1424502260528_1113_4_04 [Map 1]
 org.apache.tez.dag.app.dag.impl.AMUserCodeException: 
 org.apache.hadoop.hive.ql.metadata.HiveException: Incorrect event count in 
 dynamic parition pruning
 {code}
 !llap_vertex_200ms.png!
 All 4 upstream vertices of Map 1 need to finish within ~200ms to trigger 
 this, which seems to be consistently happening with LLAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9976) Possible race condition in DynamicPartitionPruner for 200ms tasks

2015-03-25 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-9976:
-
Attachment: HIVE-9976.2.patch

Thanks for the review. Updated patch with comments addressed, and some more 
changes.

bq. Not your fault - but there are 2 paths through HiveSplitGenerator.
Moved the methods into SplitGrouper. There's a static cache in there which 
seems a little strange. Will create a follow up jira to investigate this. For 
now I've changed that to a ConcurrentMap since split generation can run in 
parallel.

bq. i see you've fixed calling close consistently on the data input stream. 
maybe use try{}finally there?
Fixed. There was a bug with some of the other conditions which I'd changed. 
Fixed that as well.

bq. it seems you're setting numexpectedevents to 0 first and then turn around 
and call decrement. Why not just set to -1? Also - why atomic integers? as far 
as i can tell all access to these maps is synchronized.
numExpectedEvents is decremented for each column for which a source will send 
events. That's used to track total number of expected events from that source. 
Added a comment for this.
Moved from AtomicIntegers to MutableInt - this was just to avoid re-inserting 
the Integer into the map, and not for thread safety.

bq. does it make sense to make initialize in the pruner private now? (can't be 
used to init anymore - only from the constr). Also, the parameters aren't used 
anymore, right?
Done, along with some other methods.


 Possible race condition in DynamicPartitionPruner for 200ms tasks
 --

 Key: HIVE-9976
 URL: https://issues.apache.org/jira/browse/HIVE-9976
 Project: Hive
  Issue Type: Bug
  Components: Tez
Affects Versions: 1.0.0
Reporter: Gopal V
Assignee: Siddharth Seth
 Attachments: HIVE-9976.1.patch, HIVE-9976.2.patch, 
 llap_vertex_200ms.png


 Race condition in the DynamicPartitionPruner between 
 DynamicPartitionPruner::processVertex() and 
 DynamicPartitionpruner::addEvent() for tasks which respond with both the 
 result and success in a single heartbeat sequence.
 {code}
 2015-03-16 07:05:01,589 ERROR [InputInitializer [Map 1] #0] 
 tez.DynamicPartitionPruner: Expecting: 1, received: 0
 2015-03-16 07:05:01,590 ERROR [Dispatcher thread: Central] impl.VertexImpl: 
 Vertex Input: store_sales initializer failed, 
 vertex=vertex_1424502260528_1113_4_04 [Map 1]
 org.apache.tez.dag.app.dag.impl.AMUserCodeException: 
 org.apache.hadoop.hive.ql.metadata.HiveException: Incorrect event count in 
 dynamic parition pruning
 {code}
 !llap_vertex_200ms.png!
 All 4 upstream vertices of Map 1 need to finish within ~200ms to trigger 
 this, which seems to be consistently happening with LLAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-10104) LLAP: Generate consistent splits and locations for the same split across jobs

2015-03-30 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-10104.
---
Resolution: Fixed

 LLAP: Generate consistent splits and locations for the same split across jobs
 -

 Key: HIVE-10104
 URL: https://issues.apache.org/jira/browse/HIVE-10104
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10104.1.txt, HIVE-10104.2.txt


 Locations for splits are currently randomized. Also, the order of splits is 
 random - depending on how threads end up generating the splits.
 Add an option to sort the splits, and generate repeatable locations - 
 assuming all other factors are the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10106) Regression : Dynamic partition pruning not working after HIVE-9976

2015-03-27 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10106:
--
Attachment: HIVE-10106.1.patch

I believe this is caused by the way mapWork is setup. It's now created in the 
constructor of HiveSplitGenerator. The constructor and the initialize method 
may not be invoked in the same thread. As a result, the initialize method ends 
up seing a different copy of mapWork from the one modified in the pruner.
Attaching a patch to fix this - by setting the mapWork in the initialize method.
[~hagleitn] - please review, and validate the theory.
[~mmokhtar] - I wasn't able to reproduce this. Seing pruning work as it should 
for the simple query that you'd sent me offline. May need help reproducing the 
issue and validating the patch. Thanks

 Regression : Dynamic partition pruning not working after HIVE-9976
 --

 Key: HIVE-10106
 URL: https://issues.apache.org/jira/browse/HIVE-10106
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 1.2.0
Reporter: Mostafa Mokhtar
Assignee: Siddharth Seth
 Fix For: 1.2.0

 Attachments: HIVE-10106.1.patch


 After HIVE-9976 got checked in dynamic partition pruning doesn't work.
 Partitions are pruned and later show up in splits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9775) LLAP: Add a MiniLLAPCluster for tests

2015-03-02 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-9775:
-
Attachment: HIVE-9775.1.patch

Patch to add a MiniLLAPCluster. This isn't wired into the tests and shims just 
yet - that needs some more work with circular dependencies and such. Will 
figure that out in a separate jira.
Applies on top of HIVE-9808.

 LLAP: Add a MiniLLAPCluster for tests
 -

 Key: HIVE-9775
 URL: https://issues.apache.org/jira/browse/HIVE-9775
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-9775.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9807) LLAP: Add event logging for execution elements

2015-02-26 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-9807:
-
Attachment: HIVE-9807.1.patch

Sample log lines - additional details will be populated in a later patch, when 
available.
{code}
Event=FRAGMENT_START, HostName=hw10890, 
ApplicationId=application_1425008147866_0006, 
ContainerId=container_1_0006_01_01, DagName=null, VertexName=null, 
TaskId=-1, TaskAttemptId=-1, SubmitTime=1425020533780
Event=FRAGMENT_END, HostName=hw10890, 
ApplicationId=application_1425008147866_0006, 
ContainerId=container_1_0006_01_01, DagName=null, VertexName=null, 
TaskId=-1, TaskAttemptId=-1, Succeeded=true, StartTime=1425020533779, 
EndTime=1425020535678
{code}

cc [~gopalv]

 LLAP: Add event logging for execution elements
 --

 Key: HIVE-9807
 URL: https://issues.apache.org/jira/browse/HIVE-9807
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Attachments: HIVE-9807.1.patch


 For analysis of runtimes, interleaving etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9808) LLAP: Push work into daemons instead of the current pull

2015-02-27 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340878#comment-14340878
 ] 

Siddharth Seth commented on HIVE-9808:
--

Applies on top of TEZ-9807.

 LLAP: Push work into daemons instead of the current pull
 

 Key: HIVE-9808
 URL: https://issues.apache.org/jira/browse/HIVE-9808
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-9808.1.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10113) LLAP: reducers running in LLAP starve out map retries

2015-03-27 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384247#comment-14384247
 ] 

Siddharth Seth commented on HIVE-10113:
---

Related: https://issues.apache.org/jira/browse/HIVE-10029

This is expected at the moment. Until we support pre-empting tasks / removal of 
tasks from queues.

 LLAP: reducers running in LLAP starve out map retries
 -

 Key: HIVE-10113
 URL: https://issues.apache.org/jira/browse/HIVE-10113
 Project: Hive
  Issue Type: Sub-task
Reporter: Sergey Shelukhin
Assignee: Gunther Hagleitner

 When query 17 is run, some mappers from Map 1 currently fail (due to unwrap 
 issue, and also due to  HIVE-10112).
 This query has 1000+ reducers; if they are ran in llap, they all queue up, 
 and the query locks up.
 If only mappers run in LLAP, query completes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10117) LLAP: Use task number, attempt number to cache plans

2015-03-27 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10117:
--
Fix Version/s: llap

 LLAP: Use task number, attempt number to cache plans
 

 Key: HIVE-10117
 URL: https://issues.apache.org/jira/browse/HIVE-10117
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
 Fix For: llap


 Instead of relying on thread locals only. This can be used to share the work 
 between Inputs / Processor / Outputs in Tez.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10104) LLAP: Generate consistent splits and locations for the same split across jobs

2015-03-26 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10104:
--
Attachment: HIVE-10104.1.txt

Patch to order the original splits by size and name.
Location is based on a hash of the filename and start position.

[~hagleitn] - could you please take a quick look for sanity.

Will commit after I'm able to test it a bit on a cluster larger than 1 node.

 LLAP: Generate consistent splits and locations for the same split across jobs
 -

 Key: HIVE-10104
 URL: https://issues.apache.org/jira/browse/HIVE-10104
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10104.1.txt


 Locations for splits are currently randomized. Also, the order of splits is 
 random - depending on how threads end up generating the splits.
 Add an option to sort the splits, and generate repeatable locations - 
 assuming all other factors are the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10104) LLAP: Generate consistent splits and locations for the same split across jobs

2015-03-26 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10104:
--
Attachment: HIVE-10104.2.txt

Updated patch with the sort removed from the scheduler. Tested on a multi-node 
cluster. Will commit after the next rebase of the LLAP branch.

 LLAP: Generate consistent splits and locations for the same split across jobs
 -

 Key: HIVE-10104
 URL: https://issues.apache.org/jira/browse/HIVE-10104
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10104.1.txt, HIVE-10104.2.txt


 Locations for splits are currently randomized. Also, the order of splits is 
 random - depending on how threads end up generating the splits.
 Add an option to sort the splits, and generate repeatable locations - 
 assuming all other factors are the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-9913) LLAP: Avoid fetching data multiple times in case of broadcast

2015-04-20 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-9913.
--
Resolution: Fixed

 LLAP: Avoid fetching data multiple times in case of broadcast
 -

 Key: HIVE-9913
 URL: https://issues.apache.org/jira/browse/HIVE-9913
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-9913.1.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9913) LLAP: Avoid fetching data multiple times in case of broadcast

2015-04-20 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-9913:
-
Attachment: HIVE-9913.1.txt

Patch delays the start to when the Input is actually used for Unordered cases 
(broadcast and non-broadcast for now), which is soon after the Processor starts 
running.

 LLAP: Avoid fetching data multiple times in case of broadcast
 -

 Key: HIVE-9913
 URL: https://issues.apache.org/jira/browse/HIVE-9913
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-9913.1.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10408) LLAP: NPE in scheduler in case of rejected tasks

2015-04-21 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10408:
--
Summary: LLAP: NPE in scheduler in case of rejected tasks  (was: LLAP: 
query fails - NPE (old exception I posted was bogus))

 LLAP: NPE in scheduler in case of rejected tasks
 

 Key: HIVE-10408
 URL: https://issues.apache.org/jira/browse/HIVE-10408
 Project: Hive
  Issue Type: Sub-task
Reporter: Sergey Shelukhin
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10408.1.txt


 {noformat}
 java.lang.NullPointerException
 at 
 org.apache.tez.dag.app.rm.LlapTaskSchedulerService.deallocateTask(LlapTaskSchedulerService.java:388)
 at 
 org.apache.tez.dag.app.rm.TaskSchedulerEventHandler.handleTASucceeded(TaskSchedulerEventHandler.java:339)
 at 
 org.apache.tez.dag.app.rm.TaskSchedulerEventHandler.handleEvent(TaskSchedulerEventHandler.java:224)
 at 
 org.apache.tez.dag.app.rm.TaskSchedulerEventHandler$1.run(TaskSchedulerEventHandler.java:493)
 {noformat}
 The query, running alone on 10-node cluster, dumped 1000 mappers into 
 running; with 3 completed it failed with that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10475) LLAP: Minor fixes after tez api enhancements for dag completion

2015-04-23 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10475:
--
Attachment: HIVE-10475.1.txt

 LLAP: Minor fixes after tez api enhancements for dag completion
 ---

 Key: HIVE-10475
 URL: https://issues.apache.org/jira/browse/HIVE-10475
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10475.1.txt


 TEZ-2212 and TEZ-2361  add APIs to propagate dag completion information to 
 the TaskCommunicator plugin. This jira is for minor fixes to get the llap 
 branch to compile against these changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10405) LLAP: Provide runtime information to daemons to decide on preemption order

2015-04-20 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10405:
--
Attachment: HIVE-10405.1.txt

The following information is sent into daemons at fragment submission time
- start time of the dag
- start time of the first attempt of a specific fragment
- The priority of a fragment within an executing dag - determined by the 
topological order in the DAG (this is irrelevant across DAGs)
- number of tasks in the current vertex + upstream to the current vertex
- number of completed tasks in the current vertex + upstream to the current 
vertex.

 LLAP: Provide runtime information to daemons to decide on preemption order
 --

 Key: HIVE-10405
 URL: https://issues.apache.org/jira/browse/HIVE-10405
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10405.1.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10424) LLAP: Factor known capacity into scheduling decisions

2015-04-22 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10424:
--
Attachment: HIVE-10424.1.txt

Patch to factor in running queue + wait queue capacity per node.
Also moves all scheduler on to a single thread - requests go on to a queue and 
are taken off whenever a node becomes available, or has capacity.

Can run with the old 'unlimited' capacity by setting 
llap.task.scheduler.num.schedulable.tasks.per.node to -1

 LLAP: Factor known capacity into scheduling decisions
 -

 Key: HIVE-10424
 URL: https://issues.apache.org/jira/browse/HIVE-10424
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10424.1.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-10424) LLAP: Factor known capacity into scheduling decisions

2015-04-22 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-10424.
---
Resolution: Fixed

 LLAP: Factor known capacity into scheduling decisions
 -

 Key: HIVE-10424
 URL: https://issues.apache.org/jira/browse/HIVE-10424
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10424.1.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10394) LLAP: Notify AM of pre-emption

2015-04-20 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503373#comment-14503373
 ] 

Siddharth Seth commented on HIVE-10394:
---

The information isn't actually being sent across to the AM. What's handled 
right now is a response to the submitWork request. However, once a request 
moves onto the scheduler queue for execution at a later point - an RPC 
invocaiton will be required to inform the AM about the status of the task. This 
would be an addition to LlapTaskUmbilicalProtocol.

 LLAP: Notify AM of pre-emption
 --

 Key: HIVE-10394
 URL: https://issues.apache.org/jira/browse/HIVE-10394
 Project: Hive
  Issue Type: Sub-task
Affects Versions: llap
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran
 Attachments: HIVE-10394.1.patch


 Pre-empted tasks should be notified to AM as killed/interrupted by system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10480) LLAP: Tez task is interrupted for unknown reason after an IPC exception and then fails to report completion

2015-04-24 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511738#comment-14511738
 ] 

Siddharth Seth commented on HIVE-10480:
---

This is what is happening here.

The heartbeat being sent out for a task (from the LLAP daemon) to the AM is 
corrupt - TEZ-2367.
This causes an error to be reported and an Interrupt on the task.
The NPE and Ignoring exception can be ignored - that's caused by the task being 
unregistered as Prasanth pointed out. It's not the root cause of failure, and 
logging it always causes confusion. The log line has already been pruned in Tez 
(yesterday).

Since the daemon considers the task to be dead - it won't send another 
heartbeat to the AM.
The AM has no idea that the task is dead - since the last heartbeat was 
corrupt. The regular timeout mechanism kicks in, and the task is considered 
dead after 5 minutes (the default timeout). 
A new attempt of the same task is setup and runs to completion.

 LLAP: Tez task is interrupted for unknown reason after an IPC exception and 
 then fails to report completion
 ---

 Key: HIVE-10480
 URL: https://issues.apache.org/jira/browse/HIVE-10480
 Project: Hive
  Issue Type: Sub-task
Reporter: Sergey Shelukhin

 No idea if this is LLAP bug, Tez bug, Hadoop IPC bug (due to patch on the 
 cluster), or all 3.
 So for now I will just dump all I have here.
 TPCH Q1 started running for a long time for me on large number of runs today 
 (didn't happen yesterday). It would always be one Map task timing out.
  Example attempt (logs from am):
 {noformat}
 2015-04-24 11:11:01,073 INFO [TaskCommunicator # 0] 
 tezplugins.LlapTaskCommunicator: Successfully launched task: 
 attempt_1429683757595_0321_9_00_000928_0
 2015-04-24 11:16:25,498 INFO [Dispatcher thread: Central] 
 history.HistoryEventHandler: 
 [HISTORY][DAG:dag_1429683757595_0321_9][Event:TASK_ATTEMPT_FINISHED]: 
 vertexName=Map 1, taskAttemptId=attempt_1429683757595_0321_9_00_000928_0, 
 startTime=1429899061071, finishTime=1429899385498, timeTaken=324427, 
 status=FAILED, errorEnum=TASK_HEARTBEAT_ERROR, 
 diagnostics=AttemptID:attempt_1429683757595_0321_9_00_000928_0 Timed out 
 after 300 secs, counters=Counters: 1, 
 org.apache.tez.common.counters.DAGCounter, RACK_LOCAL_TASKS=1
 {noformat}
 No other lines for this attempt in between.
 However there's this:
 {noformat}
 2015-04-24 11:11:01,074 WARN [Socket Reader #1 for port 59446] ipc.Server: 
 Unable to read call parameters for client 172.19.128.56on connection protocol 
 org.apache.hadoop.hive.llap.protocol.LlapTaskUmbilicalProtocol for rpcKind 
 RPC_WRITABLE
 java.lang.ArrayIndexOutOfBoundsException
 2015-04-24 11:11:01,075 INFO [Socket Reader #1 for port 59446] ipc.Server: 
 Socket Reader #1 for port 59446: readAndProcess from client 172.19.128.56 
 threw exception [org.apache.hadoop.ipc.RpcServerException: IPC server unable 
 to read call parameters: null]
 {noformat}
 On LLAP, the following is logged 
 {noformat}
 2015-04-24 11:11:01,142 [TaskHeartbeatThread()] ERROR 
 org.apache.tez.runtime.task.TezTaskRunner: TaskReporter reported error
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException):
  IPC server unable to read call parameters: null
 at org.apache.hadoop.ipc.Client.call(Client.java:1492)
 at org.apache.hadoop.ipc.Client.call(Client.java:1423)
 at 
 org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:242)
 at com.sun.proxy.$Proxy19.heartbeat(Unknown Source)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.heartbeat(LlapTaskReporter.java:258)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.call(LlapTaskReporter.java:186)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.call(LlapTaskReporter.java:128)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 The attempt starts but is then interrupted (not clear by whom)
 {noformat}
 2015-04-24 11:11:01,144 [Initializer 
 0(container_1_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map
  1_928_0)] INFO org.apache.tez.runtime.LogicalIOProcessorRuntimeTask: 
 Initialized Input with src edge: lineitem
 2015-04-24 11:11:01,145 
 [TezTaskRunner_attempt_1429683757595_0321_9_00_000928_0(container_1_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map
  1_928_0)] INFO 

[jira] [Resolved] (HIVE-10475) LLAP: Minor fixes after tez api enhancements for dag completion

2015-04-23 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-10475.
---
Resolution: Fixed

 LLAP: Minor fixes after tez api enhancements for dag completion
 ---

 Key: HIVE-10475
 URL: https://issues.apache.org/jira/browse/HIVE-10475
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10475.1.txt


 TEZ-2212 and TEZ-2361  add APIs to propagate dag completion information to 
 the TaskCommunicator plugin. This jira is for minor fixes to get the llap 
 branch to compile against these changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HIVE-10480) LLAP: Tez task is interrupted for unknown reason after an IPC exception and then fails to report completion

2015-04-29 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth reassigned HIVE-10480:
-

Assignee: Siddharth Seth

 LLAP: Tez task is interrupted for unknown reason after an IPC exception and 
 then fails to report completion
 ---

 Key: HIVE-10480
 URL: https://issues.apache.org/jira/browse/HIVE-10480
 Project: Hive
  Issue Type: Sub-task
Reporter: Sergey Shelukhin
Assignee: Siddharth Seth
 Attachments: HIVE-10480.1.txt


 No idea if this is LLAP bug, Tez bug, Hadoop IPC bug (due to patch on the 
 cluster), or all 3.
 So for now I will just dump all I have here.
 TPCH Q1 started running for a long time for me on large number of runs today 
 (didn't happen yesterday). It would always be one Map task timing out.
  Example attempt (logs from am):
 {noformat}
 2015-04-24 11:11:01,073 INFO [TaskCommunicator # 0] 
 tezplugins.LlapTaskCommunicator: Successfully launched task: 
 attempt_1429683757595_0321_9_00_000928_0
 2015-04-24 11:16:25,498 INFO [Dispatcher thread: Central] 
 history.HistoryEventHandler: 
 [HISTORY][DAG:dag_1429683757595_0321_9][Event:TASK_ATTEMPT_FINISHED]: 
 vertexName=Map 1, taskAttemptId=attempt_1429683757595_0321_9_00_000928_0, 
 startTime=1429899061071, finishTime=1429899385498, timeTaken=324427, 
 status=FAILED, errorEnum=TASK_HEARTBEAT_ERROR, 
 diagnostics=AttemptID:attempt_1429683757595_0321_9_00_000928_0 Timed out 
 after 300 secs, counters=Counters: 1, 
 org.apache.tez.common.counters.DAGCounter, RACK_LOCAL_TASKS=1
 {noformat}
 No other lines for this attempt in between.
 However there's this:
 {noformat}
 2015-04-24 11:11:01,074 WARN [Socket Reader #1 for port 59446] ipc.Server: 
 Unable to read call parameters for client 172.19.128.56on connection protocol 
 org.apache.hadoop.hive.llap.protocol.LlapTaskUmbilicalProtocol for rpcKind 
 RPC_WRITABLE
 java.lang.ArrayIndexOutOfBoundsException
 2015-04-24 11:11:01,075 INFO [Socket Reader #1 for port 59446] ipc.Server: 
 Socket Reader #1 for port 59446: readAndProcess from client 172.19.128.56 
 threw exception [org.apache.hadoop.ipc.RpcServerException: IPC server unable 
 to read call parameters: null]
 {noformat}
 On LLAP, the following is logged 
 {noformat}
 2015-04-24 11:11:01,142 [TaskHeartbeatThread()] ERROR 
 org.apache.tez.runtime.task.TezTaskRunner: TaskReporter reported error
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException):
  IPC server unable to read call parameters: null
 at org.apache.hadoop.ipc.Client.call(Client.java:1492)
 at org.apache.hadoop.ipc.Client.call(Client.java:1423)
 at 
 org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:242)
 at com.sun.proxy.$Proxy19.heartbeat(Unknown Source)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.heartbeat(LlapTaskReporter.java:258)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.call(LlapTaskReporter.java:186)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.call(LlapTaskReporter.java:128)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 The attempt starts but is then interrupted (not clear by whom)
 {noformat}
 2015-04-24 11:11:01,144 [Initializer 
 0(container_1_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map
  1_928_0)] INFO org.apache.tez.runtime.LogicalIOProcessorRuntimeTask: 
 Initialized Input with src edge: lineitem
 2015-04-24 11:11:01,145 
 [TezTaskRunner_attempt_1429683757595_0321_9_00_000928_0(container_1_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map
  1_928_0)] INFO org.apache.tez.runtime.task.TezTaskRunner: Encounted an error 
 while executing task: attempt_1429683757595_0321_9_00_000928_0
 java.lang.InterruptedException
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
 at 
 java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
 at 
 java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
 at 
 java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
 at 
 org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.initialize(LogicalIOProcessorRuntimeTask.java:218)
 at 
 

[jira] [Commented] (HIVE-10482) LLAP: AsertionError cannot allocate when reading from orc

2015-04-30 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522282#comment-14522282
 ] 

Siddharth Seth commented on HIVE-10482:
---

The default. 1 GB I believe.

 LLAP: AsertionError cannot allocate when reading from orc
 -

 Key: HIVE-10482
 URL: https://issues.apache.org/jira/browse/HIVE-10482
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Sergey Shelukhin
 Fix For: llap


 This was from a run of tpch query 1. [~sershe] - not sure if you've already 
 seen this. Creating a jira so that it doesn't get lost.
 {code}
 2015-04-24 13:11:54,180 
 [TezTaskRunner_attempt_1429683757595_0326_4_00_000199_0(container_1_0326_01_003216_sseth_20150424131137_8ec6200c-77c8-43ea-a6a3-a0ab1da6e1ac:4_Map
  1_199_0)] ERROR org.apache.hadoop.hive.ql.exec.tez.TezProcessor: 
 org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: 
 java.io.IOException: java.lang.AssertionError: Cannot allocate
 at 
 org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:74)
 at 
 org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:314)
 at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:148)
 at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
 at 
 org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:329)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
 at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.io.IOException: java.io.IOException: 
 java.lang.AssertionError: Cannot allocate
 at 
 org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
 at 
 org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
 at 
 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:355)
 at 
 org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
 at 
 org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
 at 
 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
 at 
 org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:137)
 at 
 org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
 at 
 org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)
 ... 16 more
 Caused by: java.io.IOException: java.lang.AssertionError: Cannot allocate
 at 
 org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.rethrowErrorIfAny(LlapInputFormat.java:257)
 at 
 org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.nextCvb(LlapInputFormat.java:209)
 at 
 org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.next(LlapInputFormat.java:147)
 at 
 org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.next(LlapInputFormat.java:97)
 at 
 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:350)
 ... 22 more
 Caused by: java.lang.AssertionError: Cannot allocate
 at 
 org.apache.hadoop.hive.ql.io.orc.InStream.readEncodedStream(InStream.java:761)
 at 
 org.apache.hadoop.hive.ql.io.orc.EncodedReaderImpl.readEncodedColumns(EncodedReaderImpl.java:441)
 at 
 org.apache.hadoop.hive.llap.io.encoded.OrcEncodedDataReader.callInternal(OrcEncodedDataReader.java:294)
   

[jira] [Resolved] (HIVE-10560) LLAP: a different NPE in shuffle

2015-04-30 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-10560.
---
   Resolution: Fixed
Fix Version/s: llap

 LLAP: a different NPE in shuffle
 

 Key: HIVE-10560
 URL: https://issues.apache.org/jira/browse/HIVE-10560
 Project: Hive
  Issue Type: Sub-task
Reporter: Sergey Shelukhin
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10560.1.txt


 Lots of those in Query 1 logs; ran just now on 8 daemons on recent version.
 {noformat}
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler.unregisterDag(ShuffleHandler.java:437)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.QueryTracker.queryComplete(QueryTracker.java:81)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl.queryComplete(ContainerRunnerImpl.java:214)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon.queryComplete(LlapDaemon.java:271)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.LlapDaemonProtocolServerImpl.queryComplete(LlapDaemonProtocolServerImpl.java:94)
 at 
 org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$LlapDaemonProtocol$2.callBlockingMethod(LlapDaemonProtocolProtos.java:12278)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:972)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2088)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2084)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2082)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10560) LLAP: a different NPE in shuffle

2015-04-30 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10560:
--
Attachment: HIVE-10560.1.txt

Forgot to add a check for the dirWatcher in case it's disabled. This should fix 
it.

 LLAP: a different NPE in shuffle
 

 Key: HIVE-10560
 URL: https://issues.apache.org/jira/browse/HIVE-10560
 Project: Hive
  Issue Type: Sub-task
Reporter: Sergey Shelukhin
Assignee: Siddharth Seth
 Attachments: HIVE-10560.1.txt


 Lots of those in Query 1 logs; ran just now on 8 daemons on recent version.
 {noformat}
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler.unregisterDag(ShuffleHandler.java:437)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.QueryTracker.queryComplete(QueryTracker.java:81)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl.queryComplete(ContainerRunnerImpl.java:214)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon.queryComplete(LlapDaemon.java:271)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.LlapDaemonProtocolServerImpl.queryComplete(LlapDaemonProtocolServerImpl.java:94)
 at 
 org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$LlapDaemonProtocol$2.callBlockingMethod(LlapDaemonProtocolProtos.java:12278)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:972)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2088)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2084)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2082)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-9911) LLAP: Clean up structures and intermediate data when a query completes

2015-04-29 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-9911.
--
Resolution: Fixed

Committed to branch.

 LLAP: Clean up structures and intermediate data when a query completes
 --

 Key: HIVE-9911
 URL: https://issues.apache.org/jira/browse/HIVE-9911
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-9911.1.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10480) LLAP: Tez task is interrupted for unknown reason after an IPC exception and then fails to report completion

2015-04-29 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10480:
--
Attachment: HIVE-10480.1.txt

TEZ-2367 fixes this in a way which requires a minor update to the llap 
reporter. Uploading a patch for this.

 LLAP: Tez task is interrupted for unknown reason after an IPC exception and 
 then fails to report completion
 ---

 Key: HIVE-10480
 URL: https://issues.apache.org/jira/browse/HIVE-10480
 Project: Hive
  Issue Type: Sub-task
Reporter: Sergey Shelukhin
 Attachments: HIVE-10480.1.txt


 No idea if this is LLAP bug, Tez bug, Hadoop IPC bug (due to patch on the 
 cluster), or all 3.
 So for now I will just dump all I have here.
 TPCH Q1 started running for a long time for me on large number of runs today 
 (didn't happen yesterday). It would always be one Map task timing out.
  Example attempt (logs from am):
 {noformat}
 2015-04-24 11:11:01,073 INFO [TaskCommunicator # 0] 
 tezplugins.LlapTaskCommunicator: Successfully launched task: 
 attempt_1429683757595_0321_9_00_000928_0
 2015-04-24 11:16:25,498 INFO [Dispatcher thread: Central] 
 history.HistoryEventHandler: 
 [HISTORY][DAG:dag_1429683757595_0321_9][Event:TASK_ATTEMPT_FINISHED]: 
 vertexName=Map 1, taskAttemptId=attempt_1429683757595_0321_9_00_000928_0, 
 startTime=1429899061071, finishTime=1429899385498, timeTaken=324427, 
 status=FAILED, errorEnum=TASK_HEARTBEAT_ERROR, 
 diagnostics=AttemptID:attempt_1429683757595_0321_9_00_000928_0 Timed out 
 after 300 secs, counters=Counters: 1, 
 org.apache.tez.common.counters.DAGCounter, RACK_LOCAL_TASKS=1
 {noformat}
 No other lines for this attempt in between.
 However there's this:
 {noformat}
 2015-04-24 11:11:01,074 WARN [Socket Reader #1 for port 59446] ipc.Server: 
 Unable to read call parameters for client 172.19.128.56on connection protocol 
 org.apache.hadoop.hive.llap.protocol.LlapTaskUmbilicalProtocol for rpcKind 
 RPC_WRITABLE
 java.lang.ArrayIndexOutOfBoundsException
 2015-04-24 11:11:01,075 INFO [Socket Reader #1 for port 59446] ipc.Server: 
 Socket Reader #1 for port 59446: readAndProcess from client 172.19.128.56 
 threw exception [org.apache.hadoop.ipc.RpcServerException: IPC server unable 
 to read call parameters: null]
 {noformat}
 On LLAP, the following is logged 
 {noformat}
 2015-04-24 11:11:01,142 [TaskHeartbeatThread()] ERROR 
 org.apache.tez.runtime.task.TezTaskRunner: TaskReporter reported error
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException):
  IPC server unable to read call parameters: null
 at org.apache.hadoop.ipc.Client.call(Client.java:1492)
 at org.apache.hadoop.ipc.Client.call(Client.java:1423)
 at 
 org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:242)
 at com.sun.proxy.$Proxy19.heartbeat(Unknown Source)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.heartbeat(LlapTaskReporter.java:258)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.call(LlapTaskReporter.java:186)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.call(LlapTaskReporter.java:128)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 The attempt starts but is then interrupted (not clear by whom)
 {noformat}
 2015-04-24 11:11:01,144 [Initializer 
 0(container_1_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map
  1_928_0)] INFO org.apache.tez.runtime.LogicalIOProcessorRuntimeTask: 
 Initialized Input with src edge: lineitem
 2015-04-24 11:11:01,145 
 [TezTaskRunner_attempt_1429683757595_0321_9_00_000928_0(container_1_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map
  1_928_0)] INFO org.apache.tez.runtime.task.TezTaskRunner: Encounted an error 
 while executing task: attempt_1429683757595_0321_9_00_000928_0
 java.lang.InterruptedException
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
 at 
 java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
 at 
 java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
 at 
 java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
 at 
 

[jira] [Updated] (HIVE-9911) LLAP: Clean up structures and intermediate data when a query completes

2015-04-29 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-9911:
-
Attachment: HIVE-9911.1.txt

The patch changes intermediate data to be written to a dag specific directory, 
which gets cleaned up when a dag completes.

 LLAP: Clean up structures and intermediate data when a query completes
 --

 Key: HIVE-9911
 URL: https://issues.apache.org/jira/browse/HIVE-9911
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-9911.1.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HIVE-9911) LLAP: Clean up structures and intermediate data when a query completes

2015-04-29 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth reassigned HIVE-9911:


Assignee: Siddharth Seth

 LLAP: Clean up structures and intermediate data when a query completes
 --

 Key: HIVE-9911
 URL: https://issues.apache.org/jira/browse/HIVE-9911
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10682) LLAP: Make use of the task runner which allows killing tasks

2015-05-12 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10682:
--
Attachment: HIVE-10682.1.txt

 LLAP: Make use of the task runner which allows killing tasks
 

 Key: HIVE-10682
 URL: https://issues.apache.org/jira/browse/HIVE-10682
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10682.1.txt


 TEZ-2434 adds a runner which allows tasks to be killed. Jira to integrate 
 with that without the actual kill functionality. That will follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (HIVE-10682) LLAP: Make use of the task runner which allows killing tasks

2015-05-12 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth reopened HIVE-10682:
---

 LLAP: Make use of the task runner which allows killing tasks
 

 Key: HIVE-10682
 URL: https://issues.apache.org/jira/browse/HIVE-10682
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10682.1.txt


 TEZ-2434 adds a runner which allows tasks to be killed. Jira to integrate 
 with that without the actual kill functionality. That will follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-10682) LLAP: Make use of the task runner which allows killing tasks

2015-05-12 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-10682.
---
Resolution: Fixed

 LLAP: Make use of the task runner which allows killing tasks
 

 Key: HIVE-10682
 URL: https://issues.apache.org/jira/browse/HIVE-10682
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10682.1.txt


 TEZ-2434 adds a runner which allows tasks to be killed. Jira to integrate 
 with that without the actual kill functionality. That will follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-10682) LLAP: Make use of the task runner which allows killing tasks

2015-05-12 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-10682.
---
Resolution: Pending Closed

 LLAP: Make use of the task runner which allows killing tasks
 

 Key: HIVE-10682
 URL: https://issues.apache.org/jira/browse/HIVE-10682
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10682.1.txt


 TEZ-2434 adds a runner which allows tasks to be killed. Jira to integrate 
 with that without the actual kill functionality. That will follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10700) LLAP: Log additional debug information in the scheduler

2015-05-13 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10700:
--
Attachment: HIVE-10700.1.txt

 LLAP: Log additional debug information in the scheduler
 ---

 Key: HIVE-10700
 URL: https://issues.apache.org/jira/browse/HIVE-10700
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10700.1.txt


 Temporarily, while we're debugging issues. Change to the DBEUG log level is 
 too verbose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-10700) LLAP: Log additional debug information in the scheduler

2015-05-13 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-10700.
---
Resolution: Fixed

 LLAP: Log additional debug information in the scheduler
 ---

 Key: HIVE-10700
 URL: https://issues.apache.org/jira/browse/HIVE-10700
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10700.1.txt


 Temporarily, while we're debugging issues. Change to the DBEUG log level is 
 too verbose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-10652) LLAP: AM task communication retry is too long

2015-05-12 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-10652.
---
   Resolution: Fixed
Fix Version/s: llap

 LLAP: AM task communication retry is too long
 -

 Key: HIVE-10652
 URL: https://issues.apache.org/jira/browse/HIVE-10652
 Project: Hive
  Issue Type: Sub-task
Reporter: Sergey Shelukhin
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10652.1.txt


 Mentioned by [~sseth] while discussing HIVE-10648. 45sec (or whatever) is a 
 bit too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-10649) LLAP: AM gets stuck completely if one node is dead

2015-05-12 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-10649.
---
Resolution: Duplicate

 LLAP: AM gets stuck completely if one node is dead
 --

 Key: HIVE-10649
 URL: https://issues.apache.org/jira/browse/HIVE-10649
 Project: Hive
  Issue Type: Sub-task
Reporter: Sergey Shelukhin
Assignee: Siddharth Seth

 See HIVE-10648.
 When AM cannot connect to a node, that appears to cause it to stall; example 
 log, there are no other interleaving logs even though this is happening in 
 the middle of Map 1 on TPCH q1, i.e. there are plenty of tasks scheduled.
 From Assigning messages I can also see tasks are scheduled to all the nodes 
 before and after the pause, not just to the problematic node. 
 LLAP daemons have corresponding gaps where between two fragments nothing is 
 ran for a long time on any daemon.
 {noformat}
 2015-05-07 12:13:46,679 INFO [Dispatcher thread: Central] impl.TaskImpl: 
 task_1429683757595_0784_1_00_000276 Task Transitioned from SCHEDULED to 
 RUNNING due to event T_ATTEMPT_LAUNCHED
 2015-05-07 12:13:46,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 10 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:46,955 INFO [LlapSchedulerNodeEnabler] 
 impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583
 2015-05-07 12:13:47,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 11 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:48,812 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 12 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:49,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 13 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:50,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 14 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:51,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 15 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:52,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 16 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:53,815 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 17 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:54,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 18 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:55,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 19 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:56,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 20 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:56,971 INFO [LlapSchedulerNodeEnabler] 
 impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583
 2015-05-07 12:13:57,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 21 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:58,818 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: 

[jira] [Updated] (HIVE-10683) LLAP: Add a mechanism for daemons to inform the AM about killed tasks

2015-05-12 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10683:
--
Attachment: HIVE-10683.1.txt

 LLAP: Add a mechanism for daemons to inform the AM about killed tasks
 -

 Key: HIVE-10683
 URL: https://issues.apache.org/jira/browse/HIVE-10683
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10683.1.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10652) LLAP: AM task communication retry is too long

2015-05-12 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10652:
--
Attachment: HIVE-10652.1.addendum.txt

Addendum patch to remove some unused imports.

 LLAP: AM task communication retry is too long
 -

 Key: HIVE-10652
 URL: https://issues.apache.org/jira/browse/HIVE-10652
 Project: Hive
  Issue Type: Sub-task
Reporter: Sergey Shelukhin
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10652.1.addendum.txt, HIVE-10652.1.txt


 Mentioned by [~sseth] while discussing HIVE-10648. 45sec (or whatever) is a 
 bit too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-10683) LLAP: Add a mechanism for daemons to inform the AM about killed tasks

2015-05-12 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-10683.
---
Resolution: Fixed

 LLAP: Add a mechanism for daemons to inform the AM about killed tasks
 -

 Key: HIVE-10683
 URL: https://issues.apache.org/jira/browse/HIVE-10683
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10683.1.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10652) LLAP: AM task communication retry is too long

2015-05-12 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10652:
--
Attachment: HIVE-10652.1.txt

Made configurable, and defaults to 16s.

 LLAP: AM task communication retry is too long
 -

 Key: HIVE-10652
 URL: https://issues.apache.org/jira/browse/HIVE-10652
 Project: Hive
  Issue Type: Sub-task
Reporter: Sergey Shelukhin
Assignee: Siddharth Seth
 Attachments: HIVE-10652.1.txt


 Mentioned by [~sseth] while discussing HIVE-10648. 45sec (or whatever) is a 
 bit too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10649) LLAP: AM gets stuck completely if one node is dead

2015-05-12 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10649:
--
Assignee: (was: Siddharth Seth)

 LLAP: AM gets stuck completely if one node is dead
 --

 Key: HIVE-10649
 URL: https://issues.apache.org/jira/browse/HIVE-10649
 Project: Hive
  Issue Type: Sub-task
Reporter: Sergey Shelukhin

 See HIVE-10648.
 When AM cannot connect to a node, that appears to cause it to stall; example 
 log, there are no other interleaving logs even though this is happening in 
 the middle of Map 1 on TPCH q1, i.e. there are plenty of tasks scheduled.
 From Assigning messages I can also see tasks are scheduled to all the nodes 
 before and after the pause, not just to the problematic node. 
 LLAP daemons have corresponding gaps where between two fragments nothing is 
 ran for a long time on any daemon.
 {noformat}
 2015-05-07 12:13:46,679 INFO [Dispatcher thread: Central] impl.TaskImpl: 
 task_1429683757595_0784_1_00_000276 Task Transitioned from SCHEDULED to 
 RUNNING due to event T_ATTEMPT_LAUNCHED
 2015-05-07 12:13:46,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 10 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:46,955 INFO [LlapSchedulerNodeEnabler] 
 impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583
 2015-05-07 12:13:47,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 11 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:48,812 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 12 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:49,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 13 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:50,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 14 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:51,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 15 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:52,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 16 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:53,815 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 17 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:54,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 18 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:55,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 19 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:56,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 20 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:56,971 INFO [LlapSchedulerNodeEnabler] 
 impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583
 2015-05-07 12:13:57,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already tried 21 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
 MILLISECONDS)
 2015-05-07 12:13:58,818 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
 connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
 Already 

[jira] [Resolved] (HIVE-10730) LLAP: fix guava stopwatch conflict

2015-05-15 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-10730.
---
Resolution: Fixed

 LLAP: fix guava stopwatch conflict
 --

 Key: HIVE-10730
 URL: https://issues.apache.org/jira/browse/HIVE-10730
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10730.1.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10737) LLAP: task scheduler thread-count keeps growing

2015-05-18 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549200#comment-14549200
 ] 

Siddharth Seth commented on HIVE-10737:
---

There's other OOMs on the TaskCommunicator threads before this specific OOM. 
The AM is obviously seeing memory pressure. This could be a result of the Orc 
cache, events being stored (large jobs), uberized fetchers - where data will be 
fetched into memory - and the AM size may not have accounted for this.

 LLAP: task scheduler thread-count keeps growing
 ---

 Key: HIVE-10737
 URL: https://issues.apache.org/jira/browse/HIVE-10737
 Project: Hive
  Issue Type: Sub-task
Reporter: Gopal V
Assignee: Siddharth Seth

 LLAP AppMasters die with 
 {code}
 2015-05-17 20:22:44,513 FATAL [Thread-97] yarn.YarnUncaughtExceptionHandler: 
 Thread Thread[Thread-97,5,main] threw an Error.  Shutting down now...
 java.lang.OutOfMemoryError: unable to create new native thread
   at java.lang.Thread.start0(Native Method)
   at java.lang.Thread.start(Thread.java:714)
   at 
 java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
   at 
 java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1357)
   at 
 java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
   at 
 java.util.concurrent.Executors$DelegatedExecutorService.submit(Executors.java:681)
   at 
 org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.taskAllocated(TaskSchedulerAppCallbackWrapper.java:60)
   at 
 org.apache.tez.dag.app.rm.LocalTaskSchedulerService$AsyncDelegateRequestHandler.allocateTask(LocalTaskSchedulerService.java:410)
   at 
 org.apache.tez.dag.app.rm.LocalTaskSchedulerService$AsyncDelegateRequestHandler.processRequest(LocalTaskSchedulerService.java:394)
   at 
 org.apache.tez.dag.app.rm.LocalTaskSchedulerService$AsyncDelegateRequestHandler.run(LocalTaskSchedulerService.java:386)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10737) LLAP: task scheduler thread-count keeps growing

2015-05-18 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547624#comment-14547624
 ] 

Siddharth Seth commented on HIVE-10737:
---

[~gopalv] - do you happen to have a stack trace from the AM before it went OOM ?

 LLAP: task scheduler thread-count keeps growing
 ---

 Key: HIVE-10737
 URL: https://issues.apache.org/jira/browse/HIVE-10737
 Project: Hive
  Issue Type: Sub-task
Reporter: Gopal V
Assignee: Siddharth Seth

 LLAP AppMasters die with 
 {code}
 2015-05-17 20:22:44,513 FATAL [Thread-97] yarn.YarnUncaughtExceptionHandler: 
 Thread Thread[Thread-97,5,main] threw an Error.  Shutting down now...
 java.lang.OutOfMemoryError: unable to create new native thread
   at java.lang.Thread.start0(Native Method)
   at java.lang.Thread.start(Thread.java:714)
   at 
 java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
   at 
 java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1357)
   at 
 java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
   at 
 java.util.concurrent.Executors$DelegatedExecutorService.submit(Executors.java:681)
   at 
 org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.taskAllocated(TaskSchedulerAppCallbackWrapper.java:60)
   at 
 org.apache.tez.dag.app.rm.LocalTaskSchedulerService$AsyncDelegateRequestHandler.allocateTask(LocalTaskSchedulerService.java:410)
   at 
 org.apache.tez.dag.app.rm.LocalTaskSchedulerService$AsyncDelegateRequestHandler.processRequest(LocalTaskSchedulerService.java:394)
   at 
 org.apache.tez.dag.app.rm.LocalTaskSchedulerService$AsyncDelegateRequestHandler.run(LocalTaskSchedulerService.java:386)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10233) Hive on LLAP: Memory manager

2015-04-15 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497034#comment-14497034
 ] 

Siddharth Seth commented on HIVE-10233:
---

Looked at just the Tez Configuration changes.
- Since Hive will be setting the memory explicitly, disabling the Tez scaling 
makes sense. That's done by setting
tez.task.scale.memory.enabled = false 
(TezConfiguration.TEZ_TASK_SCALE_MEMORY_ENABLED).
This needs to be set before creating the AM, and applies to all DAGs running in 
the AM.

- TezRuntimeConfiguration.TEZ_RUNTIME_IO_SORT_MB, 
TezRuntimeConfiguration.TEZ_RUNTIME_UNORDERED_OUTPUT_BUFFER_SIZE_MB - need to 
convert the memory from bytes to MB before setting these properties
- edgeProp.getInputMemoryNeededPercent - this needs to be a fraction (0-1) 
(rather than an actual percentage (0-100)). Not sure what the method gives back 
right now.
- Missed mentioning this in the offline discussions about the properties 
involved, one more needs to be set for the Ordered case. 
(TEZ_RUNTIME_INPUT_POST_MERGE_BUFFER_PERCENT). This is a measure of how much 
memory will be used after the merge is complete to avoid spilling to disk. This 
defaults to 0, but is typically a lower value than the MergeMemory.
Given that this memory is always reserved for the Input, it can just be set to 
the Input merge memory.

There's explicit APIs which can be used to configure these properties.
{code}
.setValueSerializationClass(TezBytesWritableSerialization.class.getName(), null)
.configureOutput().setSortBufferSize([OUT_SIZE]).done()
.configureInput().setShuffleBufferFraction(IN_FRACTION).setPostMergeBufferFraction(IN_FRACTION).done()
{code}

Similarly for the UnorderedCase.





 Hive on LLAP: Memory manager
 

 Key: HIVE-10233
 URL: https://issues.apache.org/jira/browse/HIVE-10233
 Project: Hive
  Issue Type: Bug
  Components: Tez
Affects Versions: llap
Reporter: Vikram Dixit K
Assignee: Vikram Dixit K
 Attachments: HIVE-10233-WIP-2.patch, HIVE-10233-WIP.patch


 We need a memory manager in llap/tez to manage the usage of memory across 
 threads. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10029) LLAP: Scheduling of work from different queries within the daemon

2015-04-16 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498716#comment-14498716
 ] 

Siddharth Seth commented on HIVE-10029:
---

Yes, to the most part. We'll likely a need a follow up to provide data to the 
pre-emption queue.

 LLAP: Scheduling of work from different queries within the daemon
 -

 Key: HIVE-10029
 URL: https://issues.apache.org/jira/browse/HIVE-10029
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
 Fix For: llap


 The current implementation is a simple queue - whichever query wins the race 
 to submit work to a daemon will execute first.
 A policy around this may be useful - potentially a fair share, or a first 
 query in gets all slots approach.
 Also, prioritiy associated with work within a query should be considered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10335) LLAP: IndexOutOfBound in MapJoinOperator

2015-04-14 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10335:
--
Fix Version/s: llap

 LLAP: IndexOutOfBound in MapJoinOperator
 

 Key: HIVE-10335
 URL: https://issues.apache.org/jira/browse/HIVE-10335
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
 Fix For: llap


 {code}
 2015-04-14 13:57:55,889 
 [TezTaskRunner_attempt_1428572510173_0173_2_03_14_0(container_1_0173_01_66_sseth_20150414135750_7a7c2f4f-5f2d-4645-b833-677621f087bd:2_Map
  1_14_0)] ERROR org.apache.hadoop.hive.ql.exec.MapJoinOperator: Unexpected 
 exception: Index: 0, Size: 0
 java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
 at java.util.ArrayList.rangeCheck(ArrayList.java:653)
 at java.util.ArrayList.get(ArrayList.java:429)
 at 
 org.apache.hadoop.hive.ql.exec.persistence.UnwrapRowContainer.unwrap(UnwrapRowContainer.java:79)
 at 
 org.apache.hadoop.hive.ql.exec.persistence.UnwrapRowContainer.first(UnwrapRowContainer.java:62)
 at 
 org.apache.hadoop.hive.ql.exec.persistence.UnwrapRowContainer.first(UnwrapRowContainer.java:33)
 at 
 org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:670)
 at 
 org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:754)
 at 
 org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:386)
 at 
 org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.process(VectorMapJoinOperator.java:283)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
 at 
 org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.flushOutput(VectorMapJoinOperator.java:232)
 at 
 org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.closeOp(VectorMapJoinOperator.java:240)
 at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:616)
 at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:630)
 at 
 org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.close(MapRecordProcessor.java:348)
 at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162)
 at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
 at 
 org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:332)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
 at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10335) LLAP: IndexOutOfBound in MapJoinOperator

2015-04-14 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494916#comment-14494916
 ] 

Siddharth Seth commented on HIVE-10335:
---

Also
{code}
org.apache.hadoop.hive.ql.metadata.HiveException: Unexpected exception: 1024
at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:398)
at 
org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.process(VectorMapJoinOperator.java:283)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
at 
org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.flushOutput(VectorMapJoinOperator.java:232)
at 
org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.closeOp(VectorMapJoinOperator.java:240)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:616)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:630)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.close(MapRecordProcessor.java:348)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:332)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1024
at 
org.apache.hadoop.hive.ql.exec.vector.VectorColumnAssignFactory$VectorLongColumnAssign.assignLong(VectorColumnAssignFactory.java:116)
at 
org.apache.hadoop.hive.ql.exec.vector.VectorColumnAssignFactory$9.assignObjectValue(VectorColumnAssignFactory.java:296)
at 
org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.internalForward(VectorMapJoinOperator.java:223)
at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:676)
at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:754)
at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:386)
... 22 more
org.apache.hadoop.hive.ql.metadata.HiveException: Unexpected exception: null
at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:398)
at 
org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.process(VectorMapJoinOperator.java:283)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
at 
org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.flushOutput(VectorMapJoinOperator.java:232)
at 
org.apache.hadoop.hive.ql.exec.vector.VectorMapJoinOperator.closeOp(VectorMapJoinOperator.java:240)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:616)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:630)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.close(MapRecordProcessor.java:348)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:332)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
at 

[jira] [Updated] (HIVE-10229) Set conf and processor context in the constructor instead of init

2015-04-06 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10229:
--
Issue Type: Bug  (was: Sub-task)
Parent: (was: HIVE-7926)

 Set conf and processor context in the constructor instead of init
 -

 Key: HIVE-10229
 URL: https://issues.apache.org/jira/browse/HIVE-10229
 Project: Hive
  Issue Type: Bug
 Environment: 
Reporter: Sergey Shelukhin
Assignee: Siddharth Seth

 Hit this on ctas13 query.
 {noformat}
 Error: Failure while running task:java.lang.NullPointerException
at 
 org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:98)
at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:134)
at 
 org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:330)
at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180)
at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
 {noformat}
 The line is  cacheKey = queryId + processorContext.getTaskVertexName() + 
 REDUCE_PLAN_KEY;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10229) Set conf and processor context in the constructor instead of init

2015-04-06 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10229:
--
Attachment: HIVE-10229.1.patch

Fairly simple patch to set jconf and context during construction.

 Set conf and processor context in the constructor instead of init
 -

 Key: HIVE-10229
 URL: https://issues.apache.org/jira/browse/HIVE-10229
 Project: Hive
  Issue Type: Bug
 Environment: 
Reporter: Sergey Shelukhin
Assignee: Siddharth Seth
 Fix For: 1.2.0

 Attachments: HIVE-10229.1.patch


 Hit this on ctas13 query.
 {noformat}
 Error: Failure while running task:java.lang.NullPointerException
at 
 org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:98)
at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:134)
at 
 org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:330)
at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180)
at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
 {noformat}
 The line is  cacheKey = queryId + processorContext.getTaskVertexName() + 
 REDUCE_PLAN_KEY;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10229) LLAP: NPE in ReduceRecordProcessor

2015-04-06 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482118#comment-14482118
 ] 

Siddharth Seth commented on HIVE-10229:
---

Yep. Same issue I saw. ProcessorContext is null.

I'm going to upload a patch for trunk which sets the conf and context in the 
constructor instead of the init method.

 LLAP: NPE in ReduceRecordProcessor
 --

 Key: HIVE-10229
 URL: https://issues.apache.org/jira/browse/HIVE-10229
 Project: Hive
  Issue Type: Sub-task
 Environment: 
Reporter: Sergey Shelukhin
Assignee: Gunther Hagleitner

 Hit this on ctas13 query.
 {noformat}
 Error: Failure while running task:java.lang.NullPointerException
at 
 org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:98)
at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:134)
at 
 org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:330)
at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180)
at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
 {noformat}
 The line is  cacheKey = queryId + processorContext.getTaskVertexName() + 
 REDUCE_PLAN_KEY;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HIVE-10229) Set conf and processor context in the constructor instead of init

2015-04-06 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth reassigned HIVE-10229:
-

Assignee: Siddharth Seth  (was: Gunther Hagleitner)

 Set conf and processor context in the constructor instead of init
 -

 Key: HIVE-10229
 URL: https://issues.apache.org/jira/browse/HIVE-10229
 Project: Hive
  Issue Type: Sub-task
 Environment: 
Reporter: Sergey Shelukhin
Assignee: Siddharth Seth

 Hit this on ctas13 query.
 {noformat}
 Error: Failure while running task:java.lang.NullPointerException
at 
 org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:98)
at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:134)
at 
 org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:330)
at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180)
at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
 {noformat}
 The line is  cacheKey = queryId + processorContext.getTaskVertexName() + 
 REDUCE_PLAN_KEY;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10229) Set conf and processor context in the constructor instead of init

2015-04-06 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10229:
--
Summary: Set conf and processor context in the constructor instead of init  
(was: LLAP: NPE in ReduceRecordProcessor)

 Set conf and processor context in the constructor instead of init
 -

 Key: HIVE-10229
 URL: https://issues.apache.org/jira/browse/HIVE-10229
 Project: Hive
  Issue Type: Sub-task
 Environment: 
Reporter: Sergey Shelukhin
Assignee: Gunther Hagleitner

 Hit this on ctas13 query.
 {noformat}
 Error: Failure while running task:java.lang.NullPointerException
at 
 org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:98)
at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:134)
at 
 org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:330)
at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180)
at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
 {noformat}
 The line is  cacheKey = queryId + processorContext.getTaskVertexName() + 
 REDUCE_PLAN_KEY;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HIVE-10025) LLAP: Queued work times out

2015-04-07 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth reassigned HIVE-10025:
-

Assignee: Siddharth Seth

 LLAP: Queued work times out
 ---

 Key: HIVE-10025
 URL: https://issues.apache.org/jira/browse/HIVE-10025
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap


 If a daemon holds a task in queue for a long time, it'll eventually time out 
 - but isn't removed from the queue. Ideally, it shouldn't be allowed to time 
 out. Otherwise, handle the timeout so that the task doesn't run - or starts 
 and fails - likely a change in the TaskCommunicator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-10013) NPE in LLAP logs in heartbeat

2015-04-07 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-10013.
---
Resolution: Done

This should be fixed as part of TEZ-2257. Please re-open if seen again.

 NPE in LLAP logs in heartbeat
 -

 Key: HIVE-10013
 URL: https://issues.apache.org/jira/browse/HIVE-10013
 Project: Hive
  Issue Type: Sub-task
Reporter: Sergey Shelukhin

 {noformat}
 2015-03-18 17:28:37,559 
 [TezTaskRunner_attempt_1424502260528_1294_1_00_25_0(container_1_1294_01_26_sershe_20150318172752_5ce4647e-177c-4b1e-8dfa-462230735854:1_Map
  1_25_0)] INFO org.apache.tez.runtime.task.TezTaskRunner: Encounted an error 
 while executing task: attempt_1424502260528_1294_1_00_25_0
 java.lang.NullPointerException
   at 
 org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$400(TaskReporter.java:120)
   at 
 org.apache.tez.runtime.task.TaskReporter.addEvents(TaskReporter.java:386)
   at 
 org.apache.tez.runtime.task.TezTaskRunner.addEvents(TezTaskRunner.java:278)
   at 
 org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.sendTaskGeneratedEvents(LogicalIOProcessorRuntimeTask.java:596)
   at 
 org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:355)
   at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
   at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:422)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
   at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 2015-03-18 17:28:37,559 
 [TezTaskRunner_attempt_1424502260528_1294_1_00_25_0(container_1_1294_01_26_sershe_20150318172752_5ce4647e-177c-4b1e-8dfa-462230735854:1_Map
  1_25_0)] INFO org.apache.tez.runtime.task.TezTaskRunner: Ignoring the 
 following exception since a previous exception is already registered
 java.lang.NullPointerException
   at 
 org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java:120)
   at 
 org.apache.tez.runtime.task.TaskReporter.taskFailed(TaskReporter.java:382)
   at 
 org.apache.tez.runtime.task.TezTaskRunner.sendFailure(TezTaskRunner.java:260)
   at 
 org.apache.tez.runtime.task.TezTaskRunner.access$600(TezTaskRunner.java:52)
   at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:227)
   at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:422)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
   at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-10025) LLAP: Queued work times out

2015-04-07 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-10025.
---
Resolution: Fixed

 LLAP: Queued work times out
 ---

 Key: HIVE-10025
 URL: https://issues.apache.org/jira/browse/HIVE-10025
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10025.1.txt


 If a daemon holds a task in queue for a long time, it'll eventually time out 
 - but isn't removed from the queue. Ideally, it shouldn't be allowed to time 
 out. Otherwise, handle the timeout so that the task doesn't run - or starts 
 and fails - likely a change in the TaskCommunicator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10157) Make use of the timed version of getDagStatus in TezJobMonitor

2015-04-07 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10157:
--
Fix Version/s: 1.2.0

 Make use of the timed version of getDagStatus in TezJobMonitor
 --

 Key: HIVE-10157
 URL: https://issues.apache.org/jira/browse/HIVE-10157
 Project: Hive
  Issue Type: Improvement
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10185) LLAP: LLAP IO doesn't get invoked inside MiniTezCluster q tests

2015-04-02 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393097#comment-14393097
 ] 

Siddharth Seth commented on HIVE-10185:
---

MiniLlapCluster is not used yet. There's a jira open to wire it in. The cache 
should be usable in containers though, with the correct configuration ?

 LLAP: LLAP IO doesn't get invoked inside MiniTezCluster q tests
 ---

 Key: HIVE-10185
 URL: https://issues.apache.org/jira/browse/HIVE-10185
 Project: Hive
  Issue Type: Sub-task
Reporter: Sergey Shelukhin
Assignee: Siddharth Seth

 Took me a while to understand that it's not working. It might not be getting 
 initialized inside the container processes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10012) LLAP: Hive sessions run before Slider registers to YARN registry fail to launch

2015-04-10 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14490497#comment-14490497
 ] 

Siddharth Seth commented on HIVE-10012:
---

Glanced over. Mostly looks good to me. This removes some of the log messages 
when a host is selected for locality, which may be useful for debugging.
Also there's a check for local addresses which needs to be added back to the 
FixedRegistryImpl.
{code}
inetAddress = InetAddress.getByName(host);
  if (NetUtils.isLocalAddress(inetAddress)) {
{code}
Required to match the hostname reported by a daemon and the one used by the 
scheduler.

 LLAP: Hive sessions run before Slider registers to YARN registry fail to 
 launch
 ---

 Key: HIVE-10012
 URL: https://issues.apache.org/jira/browse/HIVE-10012
 Project: Hive
  Issue Type: Sub-task
Affects Versions: llap
Reporter: Gopal V
Assignee: Gopal V
 Fix For: llap

 Attachments: HIVE-10012.1.patch, HIVE-10012.wip1.patch


 The LLAP YARN registry only registers entries after at least one daemon is up.
 Any Tez session starting before that will end up with an error listing 
 zookeeper directories.
 {code}
 2015-03-18 16:54:21,392 FATAL [main] app.DAGAppMaster: Error starting 
 DAGAppMaster
 org.apache.hadoop.service.ServiceStateException: 
 org.apache.hadoop.fs.PathNotFoundException: 
 `/users/sershe/services/org-apache-hive/llap0/components/workers':
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10279) LLAP: Allow the runtime to check whether a task can run to completion

2015-04-09 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10279:
--
Fix Version/s: llap

 LLAP: Allow the runtime to check whether a task can run to completion
 -

 Key: HIVE-10279
 URL: https://issues.apache.org/jira/browse/HIVE-10279
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap


 As part of the pre-empting running tasks, and deciding which tasks can run - 
 allow the runtime to check whether a queued or running task has all it's 
 sources complete and can run through to completion, without waiting for 
 sources to finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-10767) LLAP: Improve the way task finishable information is processed

2015-05-19 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-10767.
---
   Resolution: Fixed
Fix Version/s: llap

 LLAP: Improve the way task finishable information is processed
 --

 Key: HIVE-10767
 URL: https://issues.apache.org/jira/browse/HIVE-10767
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10767.1.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (HIVE-10764) LLAP: Wait queue scheduler goes into tight loop

2015-05-19 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth reopened HIVE-10764:
---

 LLAP: Wait queue scheduler goes into tight loop
 ---

 Key: HIVE-10764
 URL: https://issues.apache.org/jira/browse/HIVE-10764
 Project: Hive
  Issue Type: Sub-task
Affects Versions: llap
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran
 Fix For: llap

 Attachments: HIVE-10764.patch


 {code}
 if (!task.canFinish() || numSlotsAvailable.get() == 0) {
 {code}
 this condition makes it to run into tight loop if no slots available and if 
 the task is finishable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-10764) LLAP: Wait queue scheduler goes into tight loop

2015-05-19 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-10764.
---
Resolution: Implemented

Done as part of HIVE-10767. The patch here was reverted.

 LLAP: Wait queue scheduler goes into tight loop
 ---

 Key: HIVE-10764
 URL: https://issues.apache.org/jira/browse/HIVE-10764
 Project: Hive
  Issue Type: Sub-task
Affects Versions: llap
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran
 Fix For: llap

 Attachments: HIVE-10764.patch


 {code}
 if (!task.canFinish() || numSlotsAvailable.get() == 0) {
 {code}
 this condition makes it to run into tight loop if no slots available and if 
 the task is finishable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10767) LLAP: Improve the way task finishable information is processed

2015-05-19 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10767:
--
Attachment: HIVE-10767.1.txt

 LLAP: Improve the way task finishable information is processed
 --

 Key: HIVE-10767
 URL: https://issues.apache.org/jira/browse/HIVE-10767
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Attachments: HIVE-10767.1.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10756) LLAP: Misc changes to daemon scheduling

2015-05-19 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10756:
--
Attachment: HIVE-10756.1.txt

[~prasanth_j] - could you take a quick look please.

 LLAP: Misc changes to daemon scheduling
 ---

 Key: HIVE-10756
 URL: https://issues.apache.org/jira/browse/HIVE-10756
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10756.1.txt


 Running the completion callback in a separate thread to avoid potentially 
 unnecessary preemptions.
 Sending out a kill to the AM only if the task was actually killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-10756) LLAP: Misc changes to daemon scheduling

2015-05-19 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-10756.
---
Resolution: Fixed

Thanks. Committed.

 LLAP: Misc changes to daemon scheduling
 ---

 Key: HIVE-10756
 URL: https://issues.apache.org/jira/browse/HIVE-10756
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10756.1.txt


 Running the completion callback in a separate thread to avoid potentially 
 unnecessary preemptions.
 Sending out a kill to the AM only if the task was actually killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-10765) LLAP: NPE when calling abort on the TezProcessor

2015-05-20 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-10765.
---
   Resolution: Fixed
Fix Version/s: llap
 Assignee: Siddharth Seth

 LLAP: NPE when calling abort on the TezProcessor
 

 Key: HIVE-10765
 URL: https://issues.apache.org/jira/browse/HIVE-10765
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
Priority: Critical
 Fix For: llap

 Attachments: HIVE-10765.1.txt, HIVE-10765.2.txt


 {code}
 2015-05-19 19:48:42,827 [Wait-Queue-Scheduler-0(null)] ERROR 
 org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService: Wait queue 
 scheduler worker exited with failure!
 java.lang.NullPointerException
   at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.abort(TezProcessor.java:177)
   at 
 org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.abortTask(LogicalIOProcessorRuntimeTask.java:698)
   at 
 org.apache.tez.runtime.task.TaskRunner2Callable.interruptTask(TaskRunner2Callable.java:118)
   at 
 org.apache.tez.runtime.task.TezTaskRunner2.killTask(TezTaskRunner2.java:261)
   at 
 org.apache.hadoop.hive.llap.daemon.impl.TaskRunnerCallable.killTask(TaskRunnerCallable.java:240)
   at 
 org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService.trySchedule(TaskExecutorService.java:262)
   at 
 org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService.access$700(TaskExecutorService.java:64)
   at 
 org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$WaitQueueWorker.run(TaskExecutorService.java:162)
   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 rrProc should be volatile. There likely need to be some checks around it to 
 ensure it's setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10765) LLAP: NPE when calling abort on the TezProcessor

2015-05-20 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10765:
--
Attachment: HIVE-10765.2.txt

Removed the volatile modifier which was part of the initial test patch. 
Committing.

 LLAP: NPE when calling abort on the TezProcessor
 

 Key: HIVE-10765
 URL: https://issues.apache.org/jira/browse/HIVE-10765
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Priority: Critical
 Attachments: HIVE-10765.1.txt, HIVE-10765.2.txt


 {code}
 2015-05-19 19:48:42,827 [Wait-Queue-Scheduler-0(null)] ERROR 
 org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService: Wait queue 
 scheduler worker exited with failure!
 java.lang.NullPointerException
   at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.abort(TezProcessor.java:177)
   at 
 org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.abortTask(LogicalIOProcessorRuntimeTask.java:698)
   at 
 org.apache.tez.runtime.task.TaskRunner2Callable.interruptTask(TaskRunner2Callable.java:118)
   at 
 org.apache.tez.runtime.task.TezTaskRunner2.killTask(TezTaskRunner2.java:261)
   at 
 org.apache.hadoop.hive.llap.daemon.impl.TaskRunnerCallable.killTask(TaskRunnerCallable.java:240)
   at 
 org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService.trySchedule(TaskExecutorService.java:262)
   at 
 org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService.access$700(TaskExecutorService.java:64)
   at 
 org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$WaitQueueWorker.run(TaskExecutorService.java:162)
   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 rrProc should be volatile. There likely need to be some checks around it to 
 ensure it's setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HIVE-10779) LLAP: Daemons should shutdown in case of fatal errors

2015-06-04 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth reassigned HIVE-10779:
-

Assignee: Siddharth Seth

 LLAP: Daemons should shutdown in case of fatal errors
 -

 Key: HIVE-10779
 URL: https://issues.apache.org/jira/browse/HIVE-10779
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Attachments: HIVE-10779.1.txt


 For example, the scheduler loop exiting. Currently they end up getting stuck 
 - while still accepting new work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10779) LLAP: Daemons should shutdown in case of fatal errors

2015-06-04 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10779:
--
Attachment: HIVE-10779.1.txt

Patch adds an UncauhtExceptionHandler and a shutdown hook to stop services.

 LLAP: Daemons should shutdown in case of fatal errors
 -

 Key: HIVE-10779
 URL: https://issues.apache.org/jira/browse/HIVE-10779
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
 Attachments: HIVE-10779.1.txt


 For example, the scheduler loop exiting. Currently they end up getting stuck 
 - while still accepting new work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-10779) LLAP: Daemons should shutdown in case of fatal errors

2015-06-04 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-10779.
---
   Resolution: Fixed
Fix Version/s: llap

Committed to the llap branch.

 LLAP: Daemons should shutdown in case of fatal errors
 -

 Key: HIVE-10779
 URL: https://issues.apache.org/jira/browse/HIVE-10779
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10779.1.txt


 For example, the scheduler loop exiting. Currently they end up getting stuck 
 - while still accepting new work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10961) LLAP: ShuffleHandler + Submit work init race condition

2015-06-09 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10961:
--
Attachment: HIVE-10961.1.txt

 LLAP: ShuffleHandler + Submit work init race condition
 --

 Key: HIVE-10961
 URL: https://issues.apache.org/jira/browse/HIVE-10961
 Project: Hive
  Issue Type: Sub-task
Affects Versions: llap
Reporter: Gopal V
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10961.1.txt


 When flexing in a new node, it accepts DAG requests before the shuffle 
 handler is setup, causing fatals
 {code}
 DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:2
 FAILED: Execution Error, return code 2 from 
 org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, 
 vertexId=vertex_1433459966952_0729_1_00, diagnostics=[Task failed, 
 taskId=task_1t
 at 
 com.google.common.base.Preconditions.checkState(Preconditions.java:145)
 at 
 org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler.get(ShuffleHandler.java:353)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl.submitWork(ContainerRunnerImpl.java:192)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon.submitWork(LlapDaemon.java:301)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.LlapDaemonProtocolServerImpl.submitWork(LlapDaemonProtocolServerImpl.java:75)
 at 
 org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$LlapDaemonProtocol$2.callBlockingMethod(LlapDaemonProtocolProtos.java:12094)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:972)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2085)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2081)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1654)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2081)
 ], TaskAttempt 1 failed, 
 info=[org.apache.hadoop.ipc.RemoteException(java.lang.IllegalStateException): 
 ShuffleHandler must be started before invoking get
 at 
 com.google.common.base.Preconditions.checkState(Preconditions.java:145)
 at 
 org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler.get(ShuffleHandler.java:353)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl.submitWork(ContainerRunnerImpl.java:192)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon.submitWork(LlapDaemon.java:301)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.LlapDaemonProtocolServerImpl.submitWork(LlapDaemonProtocolServerImpl.java:75)
 at 
 org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$LlapDaemonProtocol$2.callBlockingMethod(LlapDaemonProtocolProtos.java:12094)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:972)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2085)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-10961) LLAP: ShuffleHandler + Submit work init race condition

2015-06-09 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-10961.
---
Resolution: Fixed

 LLAP: ShuffleHandler + Submit work init race condition
 --

 Key: HIVE-10961
 URL: https://issues.apache.org/jira/browse/HIVE-10961
 Project: Hive
  Issue Type: Sub-task
Affects Versions: llap
Reporter: Gopal V
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10961.1.txt


 When flexing in a new node, it accepts DAG requests before the shuffle 
 handler is setup, causing fatals
 {code}
 DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:2
 FAILED: Execution Error, return code 2 from 
 org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, 
 vertexId=vertex_1433459966952_0729_1_00, diagnostics=[Task failed, 
 taskId=task_1t
 at 
 com.google.common.base.Preconditions.checkState(Preconditions.java:145)
 at 
 org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler.get(ShuffleHandler.java:353)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl.submitWork(ContainerRunnerImpl.java:192)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon.submitWork(LlapDaemon.java:301)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.LlapDaemonProtocolServerImpl.submitWork(LlapDaemonProtocolServerImpl.java:75)
 at 
 org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$LlapDaemonProtocol$2.callBlockingMethod(LlapDaemonProtocolProtos.java:12094)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:972)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2085)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2081)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1654)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2081)
 ], TaskAttempt 1 failed, 
 info=[org.apache.hadoop.ipc.RemoteException(java.lang.IllegalStateException): 
 ShuffleHandler must be started before invoking get
 at 
 com.google.common.base.Preconditions.checkState(Preconditions.java:145)
 at 
 org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler.get(ShuffleHandler.java:353)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl.submitWork(ContainerRunnerImpl.java:192)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon.submitWork(LlapDaemon.java:301)
 at 
 org.apache.hadoop.hive.llap.daemon.impl.LlapDaemonProtocolServerImpl.submitWork(LlapDaemonProtocolServerImpl.java:75)
 at 
 org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$LlapDaemonProtocol$2.callBlockingMethod(LlapDaemonProtocolProtos.java:12094)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:972)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2085)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10947) LLAP: preemption appears to count against failure count for the task

2015-06-09 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579604#comment-14579604
 ] 

Siddharth Seth commented on HIVE-10947:
---

If this happens again, please capture the logs. I'm not sure these tasks were 
actually preempted. They may have failed for other reasons. THere's 20 
additional attempts, most of which were KILLED (likely due to preemption) 
before the 2 FAILED aatempts - which caused the task to fail.

 LLAP: preemption appears to count against failure count for the task
 

 Key: HIVE-10947
 URL: https://issues.apache.org/jira/browse/HIVE-10947
 Project: Hive
  Issue Type: Sub-task
Reporter: Sergey Shelukhin
Assignee: Siddharth Seth

 Looks like the following stack in very parallel workload counts as task error 
 and DAG fails:
 {noformat}
 : Error while processing statement: FAILED: Execution Error, return code 2 
 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, 
 vertexName=Map 1, vertexId=vertex_1433459966952_0482_4_03, diagnostics=[Task 
 failed, taskId=task_1433459966952_0482_4_03_22, diagnostics=[TaskAttempt 
 0 killed, TaskAttempt 1 killed, TaskAttempt 2 killed, TaskAttempt 3 killed, 
 TaskAttempt 4 killed, TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 
 7 killed, TaskAttempt 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, 
 TaskAttempt 11 killed, TaskAttempt 12 killed, TaskAttempt 13 killed, 
 TaskAttempt 14 killed, TaskAttempt 15 killed, TaskAttempt 16 killed, 
 TaskAttempt 17 killed, TaskAttempt 18 killed, TaskAttempt 19 failed, 
 info=[Error: Failure while running task: 
 attempt_1433459966952_0482_4_03_22_19:java.lang.RuntimeException: 
 java.lang.RuntimeException: Map operator initialization failed
   at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:181)
   at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:146)
   at 
 org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:349)
   at 
 org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:71)
   at 
 org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:60)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:422)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1654)
   at 
 org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:60)
   at 
 org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:35)
   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.RuntimeException: Map operator initialization failed
   at 
 org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:256)
   at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:157)
   ... 14 more
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Async 
 initialization failed
   at 
 org.apache.hadoop.hive.ql.exec.Operator.completeInitialization(Operator.java:416)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:388)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:511)
   at 
 org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:464)
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:378)
   at 
 org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:241)
   ... 15 more
 Caused by: java.util.concurrent.CancellationException
   at java.util.concurrent.FutureTask.report(FutureTask.java:121)
   at java.util.concurrent.FutureTask.get(FutureTask.java:192)
   at 
 org.apache.hadoop.hive.ql.exec.Operator.completeInitialization(Operator.java:408)
   ... 20 more
 ], TaskAttempt 20 failed, info=[Error: Failure while running task: 
 attempt_1433459966952_0482_4_03_22_20:java.lang.RuntimeException: 
 java.lang.RuntimeException: Map operator initialization failed
   at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:181)
   at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:146)
   at 
 

[jira] [Commented] (HIVE-11046) Filesystem Closed Exception

2015-06-18 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592478#comment-14592478
 ] 

Siddharth Seth commented on HIVE-11046:
---

[~raj_velu] - bunch of questions. 
Do you have additional logs from the container where this error was seen ? Also 
any steps to reproduce and how often are you able to reproduce this ?
Is this using the Tez 0.7.0 release or a snapshot ?


 Filesystem Closed Exception
 ---

 Key: HIVE-11046
 URL: https://issues.apache.org/jira/browse/HIVE-11046
 Project: Hive
  Issue Type: Bug
  Components: Hive, Tez
Affects Versions: 0.7.0, 1.2.0
 Environment: Hive 1.2.0, Tez0.7.0, HDP2.2, Hadoop 2.6
Reporter: Soundararajan Velu

  TaskAttempt 2 failed, info=[Error: Failure while running 
 task:java.lang.RuntimeException: 
 org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: 
 Filesystem closed
 at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
 at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
 at 
 org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:345)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
 at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
 java.io.IOException: Filesystem closed
 at 
 org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:71)
 at 
 org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:290)
 at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:148)
 ... 14 more
 Caused by: java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:795)
 at 
 org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:629)
 at java.io.FilterInputStream.close(FilterInputStream.java:181)
 at 
 org.apache.hadoop.io.compress.DecompressorStream.close(DecompressorStream.java:205)
 at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
 at 
 org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:282)
 at 
 org.apache.hadoop.hive.ql.io.HiveRecordReader.doClose(HiveRecordReader.java:50)
 at 
 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.close(HiveContextAwareRecordReader.java:104)
 at 
 org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:170)
 at 
 org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:138)
 at 
 org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
 at 
 org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:61)
 ... 16 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-10762) LLAP: Kill any fragments running in a daemon when a query completes

2015-06-17 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved HIVE-10762.
---
Resolution: Fixed

Committed to the llap branch.

 LLAP: Kill any fragments running in a daemon when a query completes
 ---

 Key: HIVE-10762
 URL: https://issues.apache.org/jira/browse/HIVE-10762
 Project: Hive
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: llap

 Attachments: HIVE-10762.1.txt


 A query may complete due to failure or being KILLED. Fragments running in 
 daemons should be killed in these scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   6   7   8   9   10   >