[jira] [Commented] (TEZ-3479) DAG AM does not schedule any more containers in corner cases

2016-10-27 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15612706#comment-15612706
 ] 

Hitesh Shah commented on TEZ-3479:
--

\cc [~harishjp] as this is related to recovery

> DAG AM does not schedule any more containers in corner cases
> 
>
> Key: TEZ-3479
> URL: https://issues.apache.org/jira/browse/TEZ-3479
> Project: Apache Tez
>  Issue Type: Improvement
>Affects Versions: 0.7.1
>Reporter: Rajesh Balamohan
> Attachments: application_1476667862449_0031_not_complete.1.log.tar.gz
>
>
> Env: 3 node AWS cluster with data residing in S3. Tez version is 0.7.
> Some workloads end up generating lots of data that the tasks start throwing 
> "No space available" in local disks (e.g Q29 in TPCDS). DAG should fail after 
> enough number of retries which happens most of the time. Once in a while (~ 
> once in 20-30 runs), DAG AM gets into hung state and does not schedule any 
> more containers for the failed task attempts. Will attach the logs shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3479) DAG AM does not schedule any more containers in corner cases

2016-10-18 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587501#comment-15587501
 ] 

Rajesh Balamohan commented on TEZ-3479:
---

That is correct. Haven't observed this in other cases.

> DAG AM does not schedule any more containers in corner cases
> 
>
> Key: TEZ-3479
> URL: https://issues.apache.org/jira/browse/TEZ-3479
> Project: Apache Tez
>  Issue Type: Improvement
>Affects Versions: 0.7.1
>Reporter: Rajesh Balamohan
> Attachments: application_1476667862449_0031_not_complete.1.log.tar.gz
>
>
> Env: 3 node AWS cluster with data residing in S3. Tez version is 0.7.
> Some workloads end up generating lots of data that the tasks start throwing 
> "No space available" in local disks (e.g Q29 in TPCDS). DAG should fail after 
> enough number of retries which happens most of the time. Once in a while (~ 
> once in 20-30 runs), DAG AM gets into hung state and does not schedule any 
> more containers for the failed task attempts. Will attach the logs shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3479) DAG AM does not schedule any more containers in corner cases

2016-10-18 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587378#comment-15587378
 ] 

Hitesh Shah commented on TEZ-3479:
--

bq. I haven't disabled recovery in my runs.

To clarify, my question was whether this reproduces only in the cases where the 
AM crashes and restarts? 

> DAG AM does not schedule any more containers in corner cases
> 
>
> Key: TEZ-3479
> URL: https://issues.apache.org/jira/browse/TEZ-3479
> Project: Apache Tez
>  Issue Type: Improvement
>Affects Versions: 0.7.1
>Reporter: Rajesh Balamohan
> Attachments: application_1476667862449_0031_not_complete.1.log.tar.gz
>
>
> Env: 3 node AWS cluster with data residing in S3. Tez version is 0.7.
> Some workloads end up generating lots of data that the tasks start throwing 
> "No space available" in local disks (e.g Q29 in TPCDS). DAG should fail after 
> enough number of retries which happens most of the time. Once in a while (~ 
> once in 20-30 runs), DAG AM gets into hung state and does not schedule any 
> more containers for the failed task attempts. Will attach the logs shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3479) DAG AM does not schedule any more containers in corner cases

2016-10-18 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587023#comment-15587023
 ] 

Hitesh Shah commented on TEZ-3479:
--

Atleast for this scenario, I think we did not recover 
task_1476667862449_0031_1_07_04 properly to a failed state which ends up 
leading to a hang as the vertex cannot complete.

{code}
2016-10-18 07:06:24,837 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: 
Task Completion: vertex_1476667862449_0031_1_07 [Map 3], tasks=29, failed=1, 
killed=24, success=3, completed=28, commits=0, err=OWN_TASK_FAILURE 
{code}


> DAG AM does not schedule any more containers in corner cases
> 
>
> Key: TEZ-3479
> URL: https://issues.apache.org/jira/browse/TEZ-3479
> Project: Apache Tez
>  Issue Type: Improvement
>Affects Versions: 0.7.1
>Reporter: Rajesh Balamohan
> Attachments: application_1476667862449_0031_not_complete.1.log.tar.gz
>
>
> Env: 3 node AWS cluster with data residing in S3. Tez version is 0.7.
> Some workloads end up generating lots of data that the tasks start throwing 
> "No space available" in local disks (e.g Q29 in TPCDS). DAG should fail after 
> enough number of retries which happens most of the time. Once in a while (~ 
> once in 20-30 runs), DAG AM gets into hung state and does not schedule any 
> more containers for the failed task attempts. Will attach the logs shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3479) DAG AM does not schedule any more containers in corner cases

2016-10-18 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586998#comment-15586998
 ] 

Rajesh Balamohan commented on TEZ-3479:
---

[~hitesh] - I haven't disabled recovery in my runs. Will check that.

> DAG AM does not schedule any more containers in corner cases
> 
>
> Key: TEZ-3479
> URL: https://issues.apache.org/jira/browse/TEZ-3479
> Project: Apache Tez
>  Issue Type: Improvement
>Affects Versions: 0.7.1
>Reporter: Rajesh Balamohan
> Attachments: application_1476667862449_0031_not_complete.1.log.tar.gz
>
>
> Env: 3 node AWS cluster with data residing in S3. Tez version is 0.7.
> Some workloads end up generating lots of data that the tasks start throwing 
> "No space available" in local disks (e.g Q29 in TPCDS). DAG should fail after 
> enough number of retries which happens most of the time. Once in a while (~ 
> once in 20-30 runs), DAG AM gets into hung state and does not schedule any 
> more containers for the failed task attempts. Will attach the logs shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3479) DAG AM does not schedule any more containers in corner cases

2016-10-18 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586992#comment-15586992
 ] 

Hitesh Shah commented on TEZ-3479:
--

[~rajesh.balamohan] Is this happening only in the cases where the AM crashes 
and tries to recover? 

> DAG AM does not schedule any more containers in corner cases
> 
>
> Key: TEZ-3479
> URL: https://issues.apache.org/jira/browse/TEZ-3479
> Project: Apache Tez
>  Issue Type: Improvement
>Affects Versions: 0.7.1
>Reporter: Rajesh Balamohan
> Attachments: application_1476667862449_0031_not_complete.1.log.tar.gz
>
>
> Env: 3 node AWS cluster with data residing in S3. Tez version is 0.7.
> Some workloads end up generating lots of data that the tasks start throwing 
> "No space available" in local disks (e.g Q29 in TPCDS). DAG should fail after 
> enough number of retries which happens most of the time. Once in a while (~ 
> once in 20-30 runs), DAG AM gets into hung state and does not schedule any 
> more containers for the failed task attempts. Will attach the logs shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)