[jira] [Commented] (HIVE-19169) llap: Timed out after 90 secs

2018-04-18 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-19169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16443029#comment-16443029
 ] 

Sergey Shelukhin commented on HIVE-19169:
-

Can you check GC logs to see if it's hitting heavy GC that causes things to 
time out?
Overall, if you can send me logs via some channel it might be helpful.

> llap: Timed out after 90 secs
> -
>
> Key: HIVE-19169
> URL: https://issues.apache.org/jira/browse/HIVE-19169
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Zoltan Haindrich
>Priority: Major
>
> a more or less recent hive/master with tpcds1000; while running only 1 query 
> at a time on the cluster; sometimes a "timeout" happens - this could be even 
> a misconfiguration problem...I'm not sure I've set it up correctly.
> what I see from: it seems like the attempt have entered the queue and 
> accepted - after almost 90 seconds there are messages that it will be 
> pre-empted - and the 90 sec timeout happens.
> hive.log; example: attempt_1522319554594_0065_19_05_000119_14 failed
> {code}
> 2018-04-10T13:20:25,178 ERROR [HiveServer2-Background-Pool: Thread-4194]: 
> SessionState (SessionState.java:printError(1214)) - Vertex failed, 
> vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, 
> diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, 
> diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 
> secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, 
> TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 
> 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, 
> TaskAttempt 12 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 
> secs], TaskAttempt 13 killed, TaskAttempt 14 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 
> secs], TaskAttempt 15 failed, 
> info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): 
> attempt_1522319554594_0065_19_08_18_6 was not registered and couldn't be 
> removed
> 2018-04-10T13:20:25,260 ERROR [HiveServer2-Background-Pool: Thread-4194]: 
> ql.Driver (SessionState.java:printError(1214)) - FAILED: Execution Error, 
> return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, 
> vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, 
> diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, 
> diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 
> secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, 
> TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 
> 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, 
> TaskAttempt 12 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 
> secs], TaskAttempt 13 killed, TaskAttempt 14 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 
> secs], TaskAttempt 15 failed, 
> info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): 
> attempt_1522319554594_0065_19_08_18_6 was not registered and couldn't be 
> removed
> org.apache.hive.service.cli.HiveSQLException: Error while processing 
> statement: FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Reducer 
> 3, vertexId=vertex_1522319554594_0065_19_05, diagnostics=[Task failed, 
> taskId=task_1522319554594_0065_19_05_000119, diagnostics=[TaskAttempt 0 
> killed, TaskAttempt 1 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 
> secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, 
> TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 
> 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, 
> TaskAttempt 12 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 
> secs], TaskAttempt 13 killed, TaskAttempt 14 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 
> secs], TaskAttempt 15 failed, 
> info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): 
> attempt_1522319554594_0065_19_08_18_6 was not registered and couldn't be 
> removed
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Vertex failed, 
> vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, 
> diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, 
> diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, 
> info=[Att

[jira] [Commented] (HIVE-19169) llap: Timed out after 90 secs

2018-04-18 Thread Zoltan Haindrich (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-19169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16442444#comment-16442444
 ] 

Zoltan Haindrich commented on HIVE-19169:
-

I'm seeing this again with tpcds#64; I think this might be related to the fact 
that the query has a lot of locks



> llap: Timed out after 90 secs
> -
>
> Key: HIVE-19169
> URL: https://issues.apache.org/jira/browse/HIVE-19169
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Zoltan Haindrich
>Priority: Major
>
> a more or less recent hive/master with tpcds1000; while running only 1 query 
> at a time on the cluster; sometimes a "timeout" happens - this could be even 
> a misconfiguration problem...I'm not sure I've set it up correctly.
> what I see from: it seems like the attempt have entered the queue and 
> accepted - after almost 90 seconds there are messages that it will be 
> pre-empted - and the 90 sec timeout happens.
> hive.log; example: attempt_1522319554594_0065_19_05_000119_14 failed
> {code}
> 2018-04-10T13:20:25,178 ERROR [HiveServer2-Background-Pool: Thread-4194]: 
> SessionState (SessionState.java:printError(1214)) - Vertex failed, 
> vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, 
> diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, 
> diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 
> secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, 
> TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 
> 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, 
> TaskAttempt 12 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 
> secs], TaskAttempt 13 killed, TaskAttempt 14 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 
> secs], TaskAttempt 15 failed, 
> info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): 
> attempt_1522319554594_0065_19_08_18_6 was not registered and couldn't be 
> removed
> 2018-04-10T13:20:25,260 ERROR [HiveServer2-Background-Pool: Thread-4194]: 
> ql.Driver (SessionState.java:printError(1214)) - FAILED: Execution Error, 
> return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, 
> vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, 
> diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, 
> diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 
> secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, 
> TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 
> 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, 
> TaskAttempt 12 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 
> secs], TaskAttempt 13 killed, TaskAttempt 14 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 
> secs], TaskAttempt 15 failed, 
> info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): 
> attempt_1522319554594_0065_19_08_18_6 was not registered and couldn't be 
> removed
> org.apache.hive.service.cli.HiveSQLException: Error while processing 
> statement: FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Reducer 
> 3, vertexId=vertex_1522319554594_0065_19_05, diagnostics=[Task failed, 
> taskId=task_1522319554594_0065_19_05_000119, diagnostics=[TaskAttempt 0 
> killed, TaskAttempt 1 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 
> secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, 
> TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 
> 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, 
> TaskAttempt 12 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 
> secs], TaskAttempt 13 killed, TaskAttempt 14 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 
> secs], TaskAttempt 15 failed, 
> info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): 
> attempt_1522319554594_0065_19_08_18_6 was not registered and couldn't be 
> removed
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Vertex failed, 
> vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, 
> diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, 
> diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_00

[jira] [Commented] (HIVE-19169) llap: Timed out after 90 secs

2018-04-11 Thread Zoltan Haindrich (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-19169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433507#comment-16433507
 ] 

Zoltan Haindrich commented on HIVE-19169:
-

to provide better context: I'm working on reoptimizing the queries - so these 
queries are known to have a bad initial plan (usually map join is used - but 
its not gonna work; and will eventually get an oom) the problem is that it 
fails with timeout - in which case there won't be a retry...

> llap: Timed out after 90 secs
> -
>
> Key: HIVE-19169
> URL: https://issues.apache.org/jira/browse/HIVE-19169
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Zoltan Haindrich
>Priority: Major
>
> a more or less recent hive/master with tpcds1000; while running only 1 query 
> at a time on the cluster; sometimes a "timeout" happens - this could be even 
> a misconfiguration problem...I'm not sure I've set it up correctly.
> what I see from: it seems like the attempt have entered the queue and 
> accepted - after almost 90 seconds there are messages that it will be 
> pre-empted - and the 90 sec timeout happens.
> hive.log; example: attempt_1522319554594_0065_19_05_000119_14 failed
> {code}
> 2018-04-10T13:20:25,178 ERROR [HiveServer2-Background-Pool: Thread-4194]: 
> SessionState (SessionState.java:printError(1214)) - Vertex failed, 
> vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, 
> diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, 
> diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 
> secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, 
> TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 
> 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, 
> TaskAttempt 12 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 
> secs], TaskAttempt 13 killed, TaskAttempt 14 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 
> secs], TaskAttempt 15 failed, 
> info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): 
> attempt_1522319554594_0065_19_08_18_6 was not registered and couldn't be 
> removed
> 2018-04-10T13:20:25,260 ERROR [HiveServer2-Background-Pool: Thread-4194]: 
> ql.Driver (SessionState.java:printError(1214)) - FAILED: Execution Error, 
> return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, 
> vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, 
> diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, 
> diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 
> secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, 
> TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 
> 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, 
> TaskAttempt 12 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 
> secs], TaskAttempt 13 killed, TaskAttempt 14 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 
> secs], TaskAttempt 15 failed, 
> info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): 
> attempt_1522319554594_0065_19_08_18_6 was not registered and couldn't be 
> removed
> org.apache.hive.service.cli.HiveSQLException: Error while processing 
> statement: FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Reducer 
> 3, vertexId=vertex_1522319554594_0065_19_05, diagnostics=[Task failed, 
> taskId=task_1522319554594_0065_19_05_000119, diagnostics=[TaskAttempt 0 
> killed, TaskAttempt 1 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 
> secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, 
> TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 
> 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, 
> TaskAttempt 12 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 
> secs], TaskAttempt 13 killed, TaskAttempt 14 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 
> secs], TaskAttempt 15 failed, 
> info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): 
> attempt_1522319554594_0065_19_08_18_6 was not registered and couldn't be 
> removed
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Vertex failed, 
> vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, 
> diagnos

[jira] [Commented] (HIVE-19169) llap: Timed out after 90 secs

2018-04-11 Thread Zoltan Haindrich (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-19169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433495#comment-16433495
 ] 

Zoltan Haindrich commented on HIVE-19169:
-

this could be known issue ; but I wasn't able to locate a relevant jira - I'm 
using an at least 2-3 weeks old version of master

> llap: Timed out after 90 secs
> -
>
> Key: HIVE-19169
> URL: https://issues.apache.org/jira/browse/HIVE-19169
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Zoltan Haindrich
>Priority: Major
>
> a more or less recent hive/master with tpcds1000; while running only 1 query 
> at a time on the cluster; sometimes a "timeout" happens - this could be even 
> a misconfiguration problem...I'm not sure I've set it up correctly.
> what I see from: it seems like the attempt have entered the queue and 
> accepted - after almost 90 seconds there are messages that it will be 
> pre-empted - and the 90 sec timeout happens.
> hive.log; example: attempt_1522319554594_0065_19_05_000119_14 failed
> {code}
> 2018-04-10T13:20:25,178 ERROR [HiveServer2-Background-Pool: Thread-4194]: 
> SessionState (SessionState.java:printError(1214)) - Vertex failed, 
> vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, 
> diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, 
> diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 
> secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, 
> TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 
> 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, 
> TaskAttempt 12 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 
> secs], TaskAttempt 13 killed, TaskAttempt 14 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 
> secs], TaskAttempt 15 failed, 
> info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): 
> attempt_1522319554594_0065_19_08_18_6 was not registered and couldn't be 
> removed
> 2018-04-10T13:20:25,260 ERROR [HiveServer2-Background-Pool: Thread-4194]: 
> ql.Driver (SessionState.java:printError(1214)) - FAILED: Execution Error, 
> return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, 
> vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, 
> diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, 
> diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 
> secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, 
> TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 
> 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, 
> TaskAttempt 12 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 
> secs], TaskAttempt 13 killed, TaskAttempt 14 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 
> secs], TaskAttempt 15 failed, 
> info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): 
> attempt_1522319554594_0065_19_08_18_6 was not registered and couldn't be 
> removed
> org.apache.hive.service.cli.HiveSQLException: Error while processing 
> statement: FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Reducer 
> 3, vertexId=vertex_1522319554594_0065_19_05, diagnostics=[Task failed, 
> taskId=task_1522319554594_0065_19_05_000119, diagnostics=[TaskAttempt 0 
> killed, TaskAttempt 1 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 
> secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, 
> TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 
> 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, 
> TaskAttempt 12 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 
> secs], TaskAttempt 13 killed, TaskAttempt 14 failed, 
> info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 
> secs], TaskAttempt 15 failed, 
> info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): 
> attempt_1522319554594_0065_19_08_18_6 was not registered and couldn't be 
> removed
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Vertex failed, 
> vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, 
> diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, 
> diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, 
> info=[AttemptID:attempt_1522319554594_