[jira] [Commented] (HIVE-17091) "Timed out getting readerEvents" error from external LLAP client

2017-07-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091978#comment-16091978
 ] 

Hive QA commented on HIVE-17091:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12877838/HIVE-17091.2.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 13 failed/errored test(s), 11074 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_windowing2] 
(batchId=10)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[llap_smb] 
(batchId=143)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning]
 (batchId=167)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning_2]
 (batchId=169)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_explainuser_1]
 (batchId=168)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_use_op_stats]
 (batchId=167)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_use_ts_stats_for_mapjoin]
 (batchId=168)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_vectorized_dynamic_partition_pruning]
 (batchId=167)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=233)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=233)
org.apache.hive.hcatalog.api.TestHCatClient.testPartitionRegistrationWithCustomSchema
 (batchId=178)
org.apache.hive.hcatalog.api.TestHCatClient.testPartitionSpecRegistrationWithCustomSchema
 (batchId=178)
org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation 
(batchId=178)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6076/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6076/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6076/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 13 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12877838 - PreCommit-HIVE-Build

> "Timed out getting readerEvents" error from external LLAP client
> 
>
> Key: HIVE-17091
> URL: https://issues.apache.org/jira/browse/HIVE-17091
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap
>Reporter: Jason Dere
>Assignee: Jason Dere
> Attachments: HIVE-17091.1.patch, HIVE-17091.2.patch
>
>
> {noformat}
> Caused by: java.io.IOException: Timed out getting readerEvents
>   at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.getReaderEvent(LlapBaseRecordReader.java:261)
>   at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:148)
>   at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:48)
>   at 
> org.apache.hadoop.hive.llap.LlapRowRecordReader.next(LlapRowRecordReader.java:121)
>   at 
> org.apache.hadoop.hive.llap.LlapRowRecordReader.next(LlapRowRecordReader.java:68)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17091) "Timed out getting readerEvents" error from external LLAP client

2017-07-17 Thread Jason Dere (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090901#comment-16090901
 ] 

Jason Dere commented on HIVE-17091:
---

That is true, the a task heartbeat should either be interrupted or 
getReaderEvent() will end up receiving the ErrorEvent that is generated by the 
heartbeat timeout.

> "Timed out getting readerEvents" error from external LLAP client
> 
>
> Key: HIVE-17091
> URL: https://issues.apache.org/jira/browse/HIVE-17091
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap
>Reporter: Jason Dere
>Assignee: Jason Dere
> Attachments: HIVE-17091.1.patch
>
>
> {noformat}
> Caused by: java.io.IOException: Timed out getting readerEvents
>   at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.getReaderEvent(LlapBaseRecordReader.java:261)
>   at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:148)
>   at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:48)
>   at 
> org.apache.hadoop.hive.llap.LlapRowRecordReader.next(LlapRowRecordReader.java:121)
>   at 
> org.apache.hadoop.hive.llap.LlapRowRecordReader.next(LlapRowRecordReader.java:68)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17091) "Timed out getting readerEvents" error from external LLAP client

2017-07-17 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090889#comment-16090889
 ] 

Siddharth Seth commented on HIVE-17091:
---

[~jdere] - after the change, the read itself removes any timeout handling. 
Instead, if task heartbeats are not received within a certain amount of time - 
that mechanism will cause the task to timeout?

+1 for the patch, if my understanding is correct.

> "Timed out getting readerEvents" error from external LLAP client
> 
>
> Key: HIVE-17091
> URL: https://issues.apache.org/jira/browse/HIVE-17091
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap
>Reporter: Jason Dere
>Assignee: Jason Dere
> Attachments: HIVE-17091.1.patch
>
>
> {noformat}
> Caused by: java.io.IOException: Timed out getting readerEvents
>   at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.getReaderEvent(LlapBaseRecordReader.java:261)
>   at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:148)
>   at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:48)
>   at 
> org.apache.hadoop.hive.llap.LlapRowRecordReader.next(LlapRowRecordReader.java:121)
>   at 
> org.apache.hadoop.hive.llap.LlapRowRecordReader.next(LlapRowRecordReader.java:68)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17091) "Timed out getting readerEvents" error from external LLAP client

2017-07-16 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089063#comment-16089063
 ] 

Hive QA commented on HIVE-17091:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12877428/HIVE-17091.1.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 14 failed/errored test(s), 11066 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[insert_overwrite_local_directory_1]
 (batchId=238)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[llap_smb] 
(batchId=143)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning]
 (batchId=167)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning_2]
 (batchId=169)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_explainuser_1]
 (batchId=168)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_use_op_stats]
 (batchId=167)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_use_ts_stats_for_mapjoin]
 (batchId=168)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_vectorized_dynamic_partition_pruning]
 (batchId=167)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=233)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=233)
org.apache.hadoop.hive.llap.TestLlapOutputFormat.testValues (batchId=249)
org.apache.hive.hcatalog.api.TestHCatClient.testPartitionRegistrationWithCustomSchema
 (batchId=178)
org.apache.hive.hcatalog.api.TestHCatClient.testPartitionSpecRegistrationWithCustomSchema
 (batchId=178)
org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation 
(batchId=178)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6051/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6051/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6051/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 14 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12877428 - PreCommit-HIVE-Build

> "Timed out getting readerEvents" error from external LLAP client
> 
>
> Key: HIVE-17091
> URL: https://issues.apache.org/jira/browse/HIVE-17091
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap
>Reporter: Jason Dere
>Assignee: Jason Dere
> Attachments: HIVE-17091.1.patch
>
>
> {noformat}
> Caused by: java.io.IOException: Timed out getting readerEvents
>   at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.getReaderEvent(LlapBaseRecordReader.java:261)
>   at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:148)
>   at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:48)
>   at 
> org.apache.hadoop.hive.llap.LlapRowRecordReader.next(LlapRowRecordReader.java:121)
>   at 
> org.apache.hadoop.hive.llap.LlapRowRecordReader.next(LlapRowRecordReader.java:68)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17091) "Timed out getting readerEvents" error from external LLAP client

2017-07-14 Thread Jason Dere (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16088356#comment-16088356
 ] 

Jason Dere commented on HIVE-17091:
---

Patch for the issues mentioned above, plus added the fragment ID to the logging 
to make things easier to track.
[~sseth] [~prasanth_j] can you review?

> "Timed out getting readerEvents" error from external LLAP client
> 
>
> Key: HIVE-17091
> URL: https://issues.apache.org/jira/browse/HIVE-17091
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap
>Reporter: Jason Dere
>Assignee: Jason Dere
> Attachments: HIVE-17091.1.patch
>
>
> {noformat}
> Caused by: java.io.IOException: Timed out getting readerEvents
>   at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.getReaderEvent(LlapBaseRecordReader.java:261)
>   at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:148)
>   at 
> org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:48)
>   at 
> org.apache.hadoop.hive.llap.LlapRowRecordReader.next(LlapRowRecordReader.java:121)
>   at 
> org.apache.hadoop.hive.llap.LlapRowRecordReader.next(LlapRowRecordReader.java:68)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17091) "Timed out getting readerEvents" error from external LLAP client

2017-07-13 Thread Jason Dere (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086699#comment-16086699
 ] 

Jason Dere commented on HIVE-17091:
---

Looks like a couple different issues at play here:
1) On the LLAP daemon, the executor finished but it's somehow still stuck 
waiting for the ChannelOutputStream to finish all writes (even though all of 
the data was already received by the client). This might be related to the 
pendingWrites/writeMonitor logic being used by the ChannelOutputStream to 
manage the number of outstanding writes for an external fragment request. I've 
tried replacing this mechanism with a Sempaphore, and so far I haven't seen 
this issue reoccur.
{noformat}
Thread 1802 (TezTR-683826_93_0_0_29_0):
  State: WAITING
  Blocked count: 456
  Wtaited count: 458
  Waiting on java.lang.Object@7e3b8b1
  Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:502)

org.apache.hadoop.hive.llap.ChannelOutputStream.waitForWritesToFinish(ChannelOutputStream.java:153)

org.apache.hadoop.hive.llap.ChannelOutputStream.close(ChannelOutputStream.java:136)
java.io.FilterOutputStream.close(FilterOutputStream.java:159)

org.apache.hadoop.hive.llap.io.ChunkedOutputStream.close(ChunkedOutputStream.java:81)
java.io.FilterOutputStream.close(FilterOutputStream.java:159)
org.apache.hadoop.hive.llap.LlapRecordWriter.close(LlapRecordWriter.java:47)

org.apache.hadoop.hive.ql.io.HivePassThroughRecordWriter.close(HivePassThroughRecordWriter.java:46)

org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.closeWriters(FileSinkOperator.java:190)

org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:1039)
org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697)
org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)

org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.close(MapRecordProcessor.java:464)

org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:206)
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:172)

org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
{noformat}

2) The LLAP client received the end of the data stream and is expecting a 
heartbeat with a task complete notification:
{noformat}
05:47:44,060 DEBUG 
org.apache.hadoop.hive.llap.ext.LlapTaskUmbilicalExternalClient: Heartbeat from 
attempt_7085310350540683826_0089_0_00_33_0 events: 1
2017-06-29 05:47:44,060 DEBUG 
org.apache.hadoop.hive.llap.ext.LlapTaskUmbilicalExternalClient: Task update 
event for attempt_7085310350540683826_0089_0_00_33_0
2017-06-29 05:47:44,065 DEBUG 
org.apache.hadoop.hive.llap.io.ChunkedInputStream: 
LlapTaskUmbilicalExternalClient(attempt_7085310350540683826_0089_0_00_33_0):
 Chunk size 131072
2017-06-29 05:47:44,081 DEBUG 
org.apache.hadoop.hive.llap.io.ChunkedInputStream: 
LlapTaskUmbilicalExternalClient(attempt_7085310350540683826_0089_0_00_33_0):
 Chunk size 131072
2017-06-29 05:47:44,097 DEBUG 
org.apache.hadoop.hive.llap.io.ChunkedInputStream: 
LlapTaskUmbilicalExternalClient(attempt_7085310350540683826_0089_0_00_33_0):
 Chunk size 131072
2017-06-29 05:47:44,119 DEBUG 
org.apache.hadoop.hive.llap.io.ChunkedInputStream: 
LlapTaskUmbilicalExternalClient(attempt_7085310350540683826_0089_0_00_33_0):
 Chunk size 30244
2017-06-29 05:47:44,123 DEBUG 
org.apache.hadoop.hive.llap.io.ChunkedInputStream: 
LlapTaskUmbilicalExternalClient(attempt_7085310350540683826_0089_0_00_33_0):
 Chunk size 0
2017-06-29 05:47:44,123 DEBUG 
org.apache.hadoop.hive.llap.io.ChunkedInputStream: 
LlapTaskUmbilicalExternalClient(attempt_7085310350540683826_0089_0_00_33_0):
 Hit end of data
2017-06-29 05:47:44,123 INFO org.apache.hadoop.hive.llap.LlapBaseRecordReader: 
1498729664123 Waiting for reader event for 
LlapTaskUmbilicalExternalClient(attempt_7085310350540683826_0089_0_00_33_0)
{noformat}

Due to issue 1 the task completed event never arrives at the client, though the 
client continues to receive heartbeats from LLAP. Eventually (after 30 
seconds), the external client times out waiting for the task completed event. 
I'm guessing the solution on the client side is that we shouldn't be timing out 
while waiting for the task complete event as long as we are still receiving 
hearbeats. I'll try setting it to an indefinite wait.
{noformat}
2017-06-29 05:48:14,106 DEBUG 
org.apache.hadoop.hive.llap.ext.LlapTaskUmbilicalExternalClient: Received 
heartbeat from container, request={  
containerId=container_7085310350540683826_0089_00_33, requestId=487, 
startIndex=0, pre