[jira] [Commented] (TEZ-4349) DAGClient gets stuck with invalid cached DAGStatus

2022-01-06 Thread Ahmed Hussein (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470140#comment-17470140
 ] 

Ahmed Hussein commented on TEZ-4349:


Thanks [~abstractdog] for your feedback and for committing the changes!

> DAGClient gets stuck with invalid cached DAGStatus
> --
>
> Key: TEZ-4349
> URL: https://issues.apache.org/jira/browse/TEZ-4349
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Fix For: 0.10.2
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> I found that some Oozie launchers get stuck waiting for the job to complete.
> After investigation I found that {{dagClient.getDAGStatus(null)}} calls the 
> override {{dagClient.getDAGStatus(null, 0)}} , which then calls 
> {{getDAGStatusInternal}} making use of the cachedDagStatus field.
> The cachedDagStatus is never updated causing the launcher to wait 
> indefinitely.
>  
> [https://github.com/apache/tez/blob/master/tez-api/src/main/java/org/apache/tez/dag/api/client/DAGClientImpl.java#L212]
> {code:java}
>   if (!dagCompleted) {
> if (dagStatus != null) {
>   cachedDagStatus = dagStatus;
>   return dagStatus;
> }
> if (cachedDagStatus != null) {
>   // could not get from AM (not reachable/ was killed). return cached 
> status.
>   return cachedDagStatus;
> }
>   }
> {code}
> +To Fix:+
>  The {{cachedDagStatus}} should be valid for a certain amount of time, or 
> certain number of retires.
> When the cachedDAGStatus expires, the DAGClient tries to pull from AM or the 
> RM.
> An error in fetching the status from both AM and RM, would return null to the 
> caller.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (TEZ-4349) DAGClient gets stuck with invalid cached DAGStatus

2021-11-13 Thread Ahmed Hussein (Jira)
Ahmed Hussein created TEZ-4349:
--

 Summary: DAGClient gets stuck with invalid cached DAGStatus
 Key: TEZ-4349
 URL: https://issues.apache.org/jira/browse/TEZ-4349
 Project: Apache Tez
  Issue Type: Bug
Reporter: Ahmed Hussein
Assignee: Ahmed Hussein


I found that some Oozie launchers get stuck waiting for the job to complete.
After investigation I found that {{dagClient.getDAGStatus(null)}} calls the 
override {{dagClient.getDAGStatus(null, 0)}} , which then calls 
{{getDAGStatusInternal}} making use of the cachedDagStatus field.

The cachedDagStatus is never updated causing the launcher to wait indefinitely.
 
[https://github.com/apache/tez/blob/master/tez-api/src/main/java/org/apache/tez/dag/api/client/DAGClientImpl.java#L212]
{code:java}
  if (!dagCompleted) {
if (dagStatus != null) {
  cachedDagStatus = dagStatus;
  return dagStatus;
}
if (cachedDagStatus != null) {
  // could not get from AM (not reachable/ was killed). return cached 
status.
  return cachedDagStatus;
}
  }
{code}
+To Fix:+
 The {{cachedDagStatus}} should be valid for a certain amount of time, or 
certain number of retires.

When the cachedDAGStatus expires, the DAGClient tries to pull from AM or the RM.
An error in fetching the status from both AM and RM, would return null to the 
caller.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (TEZ-4252) 期望当基于tez作为计算引擎,对数据倾斜场景的调优

2020-11-23 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4252:
---
Release Note: 
Please, add a clear description of what the issue is about? It is not 
recommended that Jiras have empty descriptions. Also, Can you please change the 
title and use a translation so that the lira will be searchable?


  was:
[~yang1] Can you please add a clear description of what the issue is about? It 
is not recommended that Jiras have empty descriptions. Also, Can you please 
change the title and use a translation so that the lira will be searchable?



> 期望当基于tez作为计算引擎,对数据倾斜场景的调优
> -
>
> Key: TEZ-4252
> URL: https://issues.apache.org/jira/browse/TEZ-4252
> Project: Apache Tez
>  Issue Type: Wish
>Reporter: yang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4252) 期望当基于tez作为计算引擎,对数据倾斜场景的调优

2020-11-23 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4252:
---
Release Note: 
[~yang1] Can you please add a clear description of what the issue is about? It 
is not recommended that Jiras have empty descriptions. Also, Can you please 
change the title and use a translation so that the lira will be searchable?


  was:
[~yang1] Can you please add a clear description of what the issue is about? It 
is not recommended that Jiras have empty descriptions.
Also, Can you please change the title and use a translation so that the lira 
will be searchable?



> 期望当基于tez作为计算引擎,对数据倾斜场景的调优
> -
>
> Key: TEZ-4252
> URL: https://issues.apache.org/jira/browse/TEZ-4252
> Project: Apache Tez
>  Issue Type: Wish
>Reporter: yang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TEZ-4252) 期望当基于tez作为计算引擎,对数据倾斜场景的调优

2020-11-23 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein resolved TEZ-4252.

Release Note: 
[~yang1] Can you please add a clear description of what the issue is about? It 
is not recommended that Jiras have empty descriptions.
Also, Can you please change the title and use a translation so that the lira 
will be searchable?

  Resolution: Invalid

> 期望当基于tez作为计算引擎,对数据倾斜场景的调优
> -
>
> Key: TEZ-4252
> URL: https://issues.apache.org/jira/browse/TEZ-4252
> Project: Apache Tez
>  Issue Type: Wish
>Reporter: yang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TEZ-4119) TestSpeculation is flaky

2020-03-02 Thread Ahmed Hussein (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049306#comment-17049306
 ] 

Ahmed Hussein edited comment on TEZ-4119 at 3/2/20 3:18 PM:


Thanks [~abstractdog]. This was very helpful.

I had a look at the test case. The main problem with that test case is that it 
was designed without taking into consideration that the speculator can run as a 
background service.
Once I changed the implementation to make the speculator run in parallel, the 
test case became fuzzy.
It will take me sometime to reimplement the JUnit test according to the new 
speculator design.


was (Author: ahussein):
Thanks [~abstractdog]. This was very helpful.

I had a look at the test case. The main problem with that test case is that it 
was designed without taking into consideration that the speculator can run as a 
background service.

> TestSpeculation is flaky
> 
>
> Key: TEZ-4119
> URL: https://issues.apache.org/jira/browse/TEZ-4119
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: jstack.log, jstack4.log, jstack6.log, 
> org.apache.tez.dag.app.TestSpeculation-output.txt
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-4119) TestSpeculation is flaky

2020-03-02 Thread Ahmed Hussein (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049306#comment-17049306
 ] 

Ahmed Hussein commented on TEZ-4119:


Thanks [~abstractdog]. This was very helpful.

I had a look at the test case. The main problem with that test case is that it 
was designed without taking into consideration that the speculator can run as a 
background service.

> TestSpeculation is flaky
> 
>
> Key: TEZ-4119
> URL: https://issues.apache.org/jira/browse/TEZ-4119
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: jstack.log, jstack4.log, jstack6.log, 
> org.apache.tez.dag.app.TestSpeculation-output.txt
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-4119) TestSpeculation is flaky

2020-02-26 Thread Ahmed Hussein (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045614#comment-17045614
 ] 

Ahmed Hussein commented on TEZ-4119:


Hey [~abstractdog], Do you still see this error? 

> TestSpeculation is flaky
> 
>
> Key: TEZ-4119
> URL: https://issues.apache.org/jira/browse/TEZ-4119
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: jstack.log, jstack4.log, jstack6.log, 
> org.apache.tez.dag.app.TestSpeculation-output.txt
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4106) Add Exponential Smooth RuntimeEstimator to the speculator

2020-02-04 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4106:
---
Attachment: TEZ-4106.006.patch

> Add Exponential Smooth RuntimeEstimator to the speculator
> -
>
> Key: TEZ-4106
> URL: https://issues.apache.org/jira/browse/TEZ-4106
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TEZ-4106.001.patch, TEZ-4106.002.patch, 
> TEZ-4106.003.patch, TEZ-4106.004.patch, TEZ-4106.005.patch, TEZ-4106.006.patch
>
>
> Tez speculator implements start-end runtime estimator. Similar to 
> [MAPREDUCE-7208|https://issues.apache.org/jira/browse/MAPREDUCE-7208], we 
> need to implement an adaptive estimator based on smooth Exponential



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-4119) TestSpeculation is flaky

2020-01-31 Thread Ahmed Hussein (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027833#comment-17027833
 ] 

Ahmed Hussein commented on TEZ-4119:


Thanks [~abstractdog], I will take a look.

> TestSpeculation is flaky
> 
>
> Key: TEZ-4119
> URL: https://issues.apache.org/jira/browse/TEZ-4119
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
> Attachments: jstack.log, jstack4.log, jstack6.log, 
> org.apache.tez.dag.app.TestSpeculation-output.txt
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (TEZ-4119) TestSpeculation is flaky

2020-01-31 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein reassigned TEZ-4119:
--

Assignee: Ahmed Hussein  (was: László Bodor)

> TestSpeculation is flaky
> 
>
> Key: TEZ-4119
> URL: https://issues.apache.org/jira/browse/TEZ-4119
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: jstack.log, jstack4.log, jstack6.log, 
> org.apache.tez.dag.app.TestSpeculation-output.txt
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4106) Add Exponential Smooth RuntimeEstimator to the speculator

2020-01-31 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4106:
---
Attachment: TEZ-4106.005.patch

> Add Exponential Smooth RuntimeEstimator to the speculator
> -
>
> Key: TEZ-4106
> URL: https://issues.apache.org/jira/browse/TEZ-4106
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TEZ-4106.001.patch, TEZ-4106.002.patch, 
> TEZ-4106.003.patch, TEZ-4106.004.patch, TEZ-4106.005.patch
>
>
> Tez speculator implements start-end runtime estimator. Similar to 
> [MAPREDUCE-7208|https://issues.apache.org/jira/browse/MAPREDUCE-7208], we 
> need to implement an adaptive estimator based on smooth Exponential



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-4106) Add Exponential Smooth RuntimeEstimator to the speculator

2020-01-31 Thread Ahmed Hussein (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027822#comment-17027822
 ] 

Ahmed Hussein commented on TEZ-4106:


Thanks [~jeagles] for the feedback.
{quote}Let's clean up the TezConfiguration names if possible. Does it make 
sense to put them under a top level tez.am.speculation name? Right now there is 
speculator, speculative, speculation, so it may not be possible to be match 
perfectly with old configurations.
{quote}
Sure thing.
{quote}Also, TEZ-4119 has been filed to address the flaky tests in 
TestSpeculation. Do we need to change the patch to account for this?
{quote}
I will address TEZ-4119 separately without changing the current patch.

> Add Exponential Smooth RuntimeEstimator to the speculator
> -
>
> Key: TEZ-4106
> URL: https://issues.apache.org/jira/browse/TEZ-4106
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TEZ-4106.001.patch, TEZ-4106.002.patch, 
> TEZ-4106.003.patch, TEZ-4106.004.patch
>
>
> Tez speculator implements start-end runtime estimator. Similar to 
> [MAPREDUCE-7208|https://issues.apache.org/jira/browse/MAPREDUCE-7208], we 
> need to implement an adaptive estimator based on smooth Exponential



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-4119) TestSpeculation is flaky

2020-01-31 Thread Ahmed Hussein (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027808#comment-17027808
 ] 

Ahmed Hussein commented on TEZ-4119:


Hi [~abstractdog], thanks for reporting this issue.

I recently worked on a similar flaky test case for the  MAPREDUCE-7259 
(testSpeculateSuccessfulWithUpdateEvents fails Intermittently). I agree with 
you that this could be caused by timing issues that makes the blocking thread 
misses the speculator thread.

Have you been able to make any progress on that? If not, let me know if you 
want me to take over.

 

 

> TestSpeculation is flaky
> 
>
> Key: TEZ-4119
> URL: https://issues.apache.org/jira/browse/TEZ-4119
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
> Attachments: jstack.log, jstack4.log, jstack6.log, 
> org.apache.tez.dag.app.TestSpeculation-output.txt
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-3391) Optimize single split MR split reader

2020-01-22 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-3391:
---
Description: 
During initialization, each task creates an array of objects 
\{{TaskSplitMetaInfo[]}}. This represents unnecessary space and time overhead 
as each task needs only its corresponding split object. Beside the current 
implementation is \{{n^2}} space complexity, it leaks the inputstream.

We need to optimize that implementation by returning only a single object 
instead of an entire array. 

[~rohini] suggested the following:
{quote}
In the vertex construct TaskSplitMetaInfo only for the split of that task 
instead of constructing for all splits. ie change
public static TaskSplitMetaInfo[] readSplitMetaInfo(Configuration conf, 
FileSystem fs) to public static TaskSplitMetaInfo 
getSplitMetaInfo(Configuration conf, FileSystem fs, int index) and skip reading 
splits below the index. If there are 1000 splits, the first task will read 1 
split, second task will read 2 splits and so on instead of each task reading 
all the 1000 splits as is happening now. 
{quote}

  was:
  We had a case  where Split metadata size exceeded 1000. Instead of job 
failing from validation during initialization in AM like mapreduce, each of the 
tasks failed doing that validation during initialization.

  

Summary: Optimize single split MR split reader  (was: MR split file 
validation should be done in the AM)

> Optimize single split MR split reader
> -
>
> Key: TEZ-3391
> URL: https://issues.apache.org/jira/browse/TEZ-3391
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TEZ-3391.001.patch, TEZ-3391.002.patch
>
>
> During initialization, each task creates an array of objects 
> \{{TaskSplitMetaInfo[]}}. This represents unnecessary space and time overhead 
> as each task needs only its corresponding split object. Beside the current 
> implementation is \{{n^2}} space complexity, it leaks the inputstream.
> We need to optimize that implementation by returning only a single object 
> instead of an entire array. 
> [~rohini] suggested the following:
> {quote}
> In the vertex construct TaskSplitMetaInfo only for the split of that task 
> instead of constructing for all splits. ie change
> public static TaskSplitMetaInfo[] readSplitMetaInfo(Configuration conf, 
> FileSystem fs) to public static TaskSplitMetaInfo 
> getSplitMetaInfo(Configuration conf, FileSystem fs, int index) and skip 
> reading splits below the index. If there are 1000 splits, the first task will 
> read 1 split, second task will read 2 splits and so on instead of each task 
> reading all the 1000 splits as is happening now. 
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TEZ-3391) MR split file validation should be done in the AM

2020-01-22 Thread Ahmed Hussein (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021119#comment-17021119
 ] 

Ahmed Hussein edited comment on TEZ-3391 at 1/22/20 2:38 PM:
-

I agree with [~rohini] that the implementation is not efficient.
The ideal fix is to read the object array {{TaskSplitMetaInfo[]}} only once and 
do all the validation in the AM, then pass the {{TaskSplitMetaInfo[index]}} to 
the task initializer. This may imply significant code changes.
The existing code also has significant space overhead. Because each task 
creates an array of meta split. This means the code is {{n^2}} space 
complexity. The patch will reduce the space complexity but it each task needs 
to go through the entire meta file.
Finally, the code was not closing the InputStream properly. An exception would 
leak the handler.

[~jeagles], Can you please take a look at the patch and merge it at your 
convenience?


was (Author: ahussein):
I agree with [~rohini] that the implementation is not efficient.
The ideal fix is to read the object array {{TaskSplitMetaInfo[]}} only once and 
do all the validation in the AM, then pass the {{TaskSplitMetaInfo[index]}} to 
the task initializer. This may imply significant code changes.
The existing code also has significant space overhead. Because each task 
creates an array of meta split. This means the code is {{n^2}} space 
complexity. The patch will reduce the space complexity but it each task needs 
to go through the entire meta file.

[~jeagles], Can you please take a look at the patch and merge it at your 
convenience?

> MR split file validation should be done in the AM
> -
>
> Key: TEZ-3391
> URL: https://issues.apache.org/jira/browse/TEZ-3391
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TEZ-3391.001.patch, TEZ-3391.002.patch
>
>
>   We had a case  where Split metadata size exceeded 1000. Instead of job 
> failing from validation during initialization in AM like mapreduce, each of 
> the tasks failed doing that validation during initialization.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-3391) MR split file validation should be done in the AM

2020-01-22 Thread Ahmed Hussein (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021119#comment-17021119
 ] 

Ahmed Hussein commented on TEZ-3391:


I agree with [~rohini] that the implementation is not efficient.
The ideal fix is to read the object array {{TaskSplitMetaInfo[]}} only once and 
do all the validation in the AM, then pass the {{TaskSplitMetaInfo[index]}} to 
the task initializer. This may imply significant code changes.
The existing code also has significant space overhead. Because each task 
creates an array of meta split. This means the code is {{n^2}} space 
complexity. The patch will reduce the space complexity but it each task needs 
to go through the entire meta file.

[~jeagles], Can you please take a look at the patch and merge it at your 
convenience?

> MR split file validation should be done in the AM
> -
>
> Key: TEZ-3391
> URL: https://issues.apache.org/jira/browse/TEZ-3391
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TEZ-3391.001.patch, TEZ-3391.002.patch
>
>
>   We had a case  where Split metadata size exceeded 1000. Instead of job 
> failing from validation during initialization in AM like mapreduce, each of 
> the tasks failed doing that validation during initialization.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-3391) MR split file validation should be done in the AM

2020-01-22 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-3391:
---
Attachment: TEZ-3391.002.patch

> MR split file validation should be done in the AM
> -
>
> Key: TEZ-3391
> URL: https://issues.apache.org/jira/browse/TEZ-3391
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TEZ-3391.001.patch, TEZ-3391.002.patch
>
>
>   We had a case  where Split metadata size exceeded 1000. Instead of job 
> failing from validation during initialization in AM like mapreduce, each of 
> the tasks failed doing that validation during initialization.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-3391) MR split file validation should be done in the AM

2020-01-21 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-3391:
---
Attachment: TEZ-3391.001.patch

> MR split file validation should be done in the AM
> -
>
> Key: TEZ-3391
> URL: https://issues.apache.org/jira/browse/TEZ-3391
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TEZ-3391.001.patch
>
>
>   We had a case  where Split metadata size exceeded 1000. Instead of job 
> failing from validation during initialization in AM like mapreduce, each of 
> the tasks failed doing that validation during initialization.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (TEZ-3391) MR split file validation should be done in the AM

2020-01-21 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein reassigned TEZ-3391:
--

Assignee: Ahmed Hussein  (was: Nishant Dash)

> MR split file validation should be done in the AM
> -
>
> Key: TEZ-3391
> URL: https://issues.apache.org/jira/browse/TEZ-3391
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Ahmed Hussein
>Priority: Major
>
>   We had a case  where Split metadata size exceeded 1000. Instead of job 
> failing from validation during initialization in AM like mapreduce, each of 
> the tasks failed doing that validation during initialization.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-4106) Add Exponential Smooth RuntimeEstimator to the speculator

2019-12-30 Thread Ahmed Hussein (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005483#comment-17005483
 ] 

Ahmed Hussein commented on TEZ-4106:


[~jeagles] Can you please review the patch?

> Add Exponential Smooth RuntimeEstimator to the speculator
> -
>
> Key: TEZ-4106
> URL: https://issues.apache.org/jira/browse/TEZ-4106
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TEZ-4106.001.patch, TEZ-4106.002.patch, 
> TEZ-4106.003.patch, TEZ-4106.004.patch
>
>
> Tez speculator implements start-end runtime estimator. Similar to 
> [MAPREDUCE-7208|https://issues.apache.org/jira/browse/MAPREDUCE-7208], we 
> need to implement an adaptive estimator based on smooth Exponential



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4106) Add Exponential Smooth RuntimeEstimator to the speculator

2019-12-30 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4106:
---
Attachment: TEZ-4106.004.patch

> Add Exponential Smooth RuntimeEstimator to the speculator
> -
>
> Key: TEZ-4106
> URL: https://issues.apache.org/jira/browse/TEZ-4106
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TEZ-4106.001.patch, TEZ-4106.002.patch, 
> TEZ-4106.003.patch, TEZ-4106.004.patch
>
>
> Tez speculator implements start-end runtime estimator. Similar to 
> [MAPREDUCE-7208|https://issues.apache.org/jira/browse/MAPREDUCE-7208], we 
> need to implement an adaptive estimator based on smooth Exponential



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4106) Add Exponential Smooth RuntimeEstimator to the speculator

2019-12-30 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4106:
---
Attachment: TEZ-4106.003.patch

> Add Exponential Smooth RuntimeEstimator to the speculator
> -
>
> Key: TEZ-4106
> URL: https://issues.apache.org/jira/browse/TEZ-4106
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TEZ-4106.001.patch, TEZ-4106.002.patch, 
> TEZ-4106.003.patch
>
>
> Tez speculator implements start-end runtime estimator. Similar to 
> [MAPREDUCE-7208|https://issues.apache.org/jira/browse/MAPREDUCE-7208], we 
> need to implement an adaptive estimator based on smooth Exponential



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4106) Add Exponential Smooth RuntimeEstimator to the speculator

2019-12-20 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4106:
---
Attachment: TEZ-4106.002.patch

> Add Exponential Smooth RuntimeEstimator to the speculator
> -
>
> Key: TEZ-4106
> URL: https://issues.apache.org/jira/browse/TEZ-4106
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TEZ-4106.001.patch, TEZ-4106.002.patch
>
>
> Tez speculator implements start-end runtime estimator. Similar to 
> [MAPREDUCE-7208|https://issues.apache.org/jira/browse/MAPREDUCE-7208], we 
> need to implement an adaptive estimator based on smooth Exponential



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4106) Add Exponential Smooth RuntimeEstimator to the speculator

2019-12-09 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4106:
---
Description: Tez speculator implements start-end runtime estimator. Similar 
to [MAPREDUCE-7208|https://issues.apache.org/jira/browse/MAPREDUCE-7208], we 
need to implement an adaptive estimator based on smooth Exponential

> Add Exponential Smooth RuntimeEstimator to the speculator
> -
>
> Key: TEZ-4106
> URL: https://issues.apache.org/jira/browse/TEZ-4106
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
>
> Tez speculator implements start-end runtime estimator. Similar to 
> [MAPREDUCE-7208|https://issues.apache.org/jira/browse/MAPREDUCE-7208], we 
> need to implement an adaptive estimator based on smooth Exponential



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TEZ-4106) Add Exponential Smooth RuntimeEstimator to the speculator

2019-12-09 Thread Ahmed Hussein (Jira)
Ahmed Hussein created TEZ-4106:
--

 Summary: Add Exponential Smooth RuntimeEstimator to the speculator
 Key: TEZ-4106
 URL: https://issues.apache.org/jira/browse/TEZ-4106
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Ahmed Hussein
Assignee: Ahmed Hussein






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4103) Progress in DAG, Vertex, and tasks is incorrect

2019-12-04 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4103:
---
Attachment: TEZ-4103.006.patch

> Progress in DAG, Vertex, and tasks is incorrect
> ---
>
> Key: TEZ-4103
> URL: https://issues.apache.org/jira/browse/TEZ-4103
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TEZ-4103.001.patch, TEZ-4103.002.patch, 
> TEZ-4103.003.patch, TEZ-4103.004.patch, TEZ-4103.005.patch, TEZ-4103.006.patch
>
>
> Looking at the progress code, there some few issues that could lead to some 
> problems calculating the progress.
>  There are some cases when the progress never reach 1.0.
>  This is a list of issues that need to be fixed in the progress code:
>  * After TEZ-3982, since values are skipped in the In some cases, the 
> progress of DAG or a vertex may never reach 1.0f. this is in both 
> "{{DAGImpl.java}}" and "{{ProgressHelper.java}}"
>  * {{ProgressHelper}} schedules a service to update the progress, dubbed 
> `{{ProgressHelper.monitorProgress}}`. According to Java Documentation:
> {quote}If any execution of the task encounters an exception,
>  subsequent executions are suppressed.
>  Otherwise, the task will only terminate via cancellation
>  or termination of the executor.
> {quote}
> In other words, if the service dies, there is no way to catch that in the 
> code and the progress will never be updated.
>  * The `{{SimpleProcessor.inputMap}}` is not thread-safe. They are 
> initialized as `{{LinkedHashMap}}` and there is no synchronization on the 
> field objects in the map. This could be problematic in concurrent context.
>  * `{{VertexImpl.getProgress()}}` does not check the range of the progress 
> calculated in `{{VertexImpl.computeProgress()}}`
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4103) Progress in DAG, Vertex, and tasks is incorrect

2019-12-04 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4103:
---
Attachment: TEZ-4103.005.patch

> Progress in DAG, Vertex, and tasks is incorrect
> ---
>
> Key: TEZ-4103
> URL: https://issues.apache.org/jira/browse/TEZ-4103
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TEZ-4103.001.patch, TEZ-4103.002.patch, 
> TEZ-4103.003.patch, TEZ-4103.004.patch, TEZ-4103.005.patch
>
>
> Looking at the progress code, there some few issues that could lead to some 
> problems calculating the progress.
>  There are some cases when the progress never reach 1.0.
>  This is a list of issues that need to be fixed in the progress code:
>  * After TEZ-3982, since values are skipped in the In some cases, the 
> progress of DAG or a vertex may never reach 1.0f. this is in both 
> "{{DAGImpl.java}}" and "{{ProgressHelper.java}}"
>  * {{ProgressHelper}} schedules a service to update the progress, dubbed 
> `{{ProgressHelper.monitorProgress}}`. According to Java Documentation:
> {quote}If any execution of the task encounters an exception,
>  subsequent executions are suppressed.
>  Otherwise, the task will only terminate via cancellation
>  or termination of the executor.
> {quote}
> In other words, if the service dies, there is no way to catch that in the 
> code and the progress will never be updated.
>  * The `{{SimpleProcessor.inputMap}}` is not thread-safe. They are 
> initialized as `{{LinkedHashMap}}` and there is no synchronization on the 
> field objects in the map. This could be problematic in concurrent context.
>  * `{{VertexImpl.getProgress()}}` does not check the range of the progress 
> calculated in `{{VertexImpl.computeProgress()}}`
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4103) Progress in DAG, Vertex, and tasks is incorrect

2019-12-03 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4103:
---
Attachment: TEZ-4103.004.patch

> Progress in DAG, Vertex, and tasks is incorrect
> ---
>
> Key: TEZ-4103
> URL: https://issues.apache.org/jira/browse/TEZ-4103
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TEZ-4103.001.patch, TEZ-4103.002.patch, 
> TEZ-4103.003.patch, TEZ-4103.004.patch
>
>
> Looking at the progress code, there some few issues that could lead to some 
> problems calculating the progress.
>  There are some cases when the progress never reach 1.0.
>  This is a list of issues that need to be fixed in the progress code:
>  * After TEZ-3982, since values are skipped in the In some cases, the 
> progress of DAG or a vertex may never reach 1.0f. this is in both 
> "{{DAGImpl.java}}" and "{{ProgressHelper.java}}"
>  * {{ProgressHelper}} schedules a service to update the progress, dubbed 
> `{{ProgressHelper.monitorProgress}}`. According to Java Documentation:
> {quote}If any execution of the task encounters an exception,
>  subsequent executions are suppressed.
>  Otherwise, the task will only terminate via cancellation
>  or termination of the executor.
> {quote}
> In other words, if the service dies, there is no way to catch that in the 
> code and the progress will never be updated.
>  * The `{{SimpleProcessor.inputMap}}` is not thread-safe. They are 
> initialized as `{{LinkedHashMap}}` and there is no synchronization on the 
> field objects in the map. This could be problematic in concurrent context.
>  * `{{VertexImpl.getProgress()}}` does not check the range of the progress 
> calculated in `{{VertexImpl.computeProgress()}}`
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-4103) Progress in DAG, Vertex, and tasks is incorrect

2019-12-02 Thread Ahmed Hussein (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986442#comment-16986442
 ] 

Ahmed Hussein commented on TEZ-4103:


{quote}I can see this patch goes through great effort to centralize logging 
into the ProgressHelper. However, it adds IMHO unnecessarily complex code by 
using lambdas in log statements as well as separates the condition checking and 
logging from its origin of error. Unless I'm missing the necessity, I think 
this code becomes much simpler with a simple if isDebugEnabled() check followed 
by parameterized LOG.debug statement. Once this is done we can remove the 
logDebug helper methods.
{quote}
I thought that lambda expressions reduce the overhead because the expression 
(i.e., parameters to the lambda expression and string formatting) won't be 
evaluated until the {{fn.apply()}} is called. I will replace the lambda with 
simple {{isDebugEnabled()}}. Yet, we need a way to aggregate the progress 
logging to make it easy to debug. For example, when we use {{isDebugEnabled()}} 
we will need to enable the logging for all classes that have {{getProgress()}} 
method. On the other hand, logging in one class makes it easy to enable/disable 
the debugging of {{getProgress()}}.

{quote}I also wondered about the thread monitoring. Can you help me to 
understand why a catch (Throwable) wasn't sufficient. As per 
https://stackoverflow.com/a/24902026. Seems like (though I am not positive) we 
have created a thread to monitor the other thread.{quote}

I was confused by the java doc thinking that the future invocation will halt as 
long as the thread exception in the JVM has been set. I will simplify the code 
by removing the re-launching piece.

{quote}Functionally, it isn't incorrect to use a LogicalInput that isn't 
AbstractLogicalInput. While I like logging the non-compliant class as 
speculative execution is very limited in that scenario, is it too excessive to 
log that condition every time?{quote}
I saw in the javaDoc that {{AbstractLogicalInput}} has to be the base for all 
implementations. If that's the design, then it should be incorrect to have 
different implementations.

{code:java}
/**
 * An abstract class which should be the base class for all implementations of 
LogicalInput.
 *
 * This class implements the framework facing as well as user facing methods 
which need to be
 * implemented by all LogicalInputs.
{code}



 
 

> Progress in DAG, Vertex, and tasks is incorrect
> ---
>
> Key: TEZ-4103
> URL: https://issues.apache.org/jira/browse/TEZ-4103
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TEZ-4103.001.patch, TEZ-4103.002.patch, 
> TEZ-4103.003.patch
>
>
> Looking at the progress code, there some few issues that could lead to some 
> problems calculating the progress.
>  There are some cases when the progress never reach 1.0.
>  This is a list of issues that need to be fixed in the progress code:
>  * After TEZ-3982, since values are skipped in the In some cases, the 
> progress of DAG or a vertex may never reach 1.0f. this is in both 
> "{{DAGImpl.java}}" and "{{ProgressHelper.java}}"
>  * {{ProgressHelper}} schedules a service to update the progress, dubbed 
> `{{ProgressHelper.monitorProgress}}`. According to Java Documentation:
> {quote}If any execution of the task encounters an exception,
>  subsequent executions are suppressed.
>  Otherwise, the task will only terminate via cancellation
>  or termination of the executor.
> {quote}
> In other words, if the service dies, there is no way to catch that in the 
> code and the progress will never be updated.
>  * The `{{SimpleProcessor.inputMap}}` is not thread-safe. They are 
> initialized as `{{LinkedHashMap}}` and there is no synchronization on the 
> field objects in the map. This could be problematic in concurrent context.
>  * `{{VertexImpl.getProgress()}}` does not check the range of the progress 
> calculated in `{{VertexImpl.computeProgress()}}`
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4103) Progress in DAG, Vertex, and tasks is incorrect

2019-12-02 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4103:
---
Attachment: TEZ-4103.003.patch

> Progress in DAG, Vertex, and tasks is incorrect
> ---
>
> Key: TEZ-4103
> URL: https://issues.apache.org/jira/browse/TEZ-4103
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TEZ-4103.001.patch, TEZ-4103.002.patch, 
> TEZ-4103.003.patch
>
>
> Looking at the progress code, there some few issues that could lead to some 
> problems calculating the progress.
>  There are some cases when the progress never reach 1.0.
>  This is a list of issues that need to be fixed in the progress code:
>  * After TEZ-3982, since values are skipped in the In some cases, the 
> progress of DAG or a vertex may never reach 1.0f. this is in both 
> "{{DAGImpl.java}}" and "{{ProgressHelper.java}}"
>  * {{ProgressHelper}} schedules a service to update the progress, dubbed 
> `{{ProgressHelper.monitorProgress}}`. According to Java Documentation:
> {quote}If any execution of the task encounters an exception,
>  subsequent executions are suppressed.
>  Otherwise, the task will only terminate via cancellation
>  or termination of the executor.
> {quote}
> In other words, if the service dies, there is no way to catch that in the 
> code and the progress will never be updated.
>  * The `{{SimpleProcessor.inputMap}}` is not thread-safe. They are 
> initialized as `{{LinkedHashMap}}` and there is no synchronization on the 
> field objects in the map. This could be problematic in concurrent context.
>  * `{{VertexImpl.getProgress()}}` does not check the range of the progress 
> calculated in `{{VertexImpl.computeProgress()}}`
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4103) Progress in DAG, Vertex, and tasks is incorrect

2019-12-02 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4103:
---
Attachment: TEZ-4103.002.patch

> Progress in DAG, Vertex, and tasks is incorrect
> ---
>
> Key: TEZ-4103
> URL: https://issues.apache.org/jira/browse/TEZ-4103
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TEZ-4103.001.patch, TEZ-4103.002.patch
>
>
> Looking at the progress code, there some few issues that could lead to some 
> problems calculating the progress.
>  There are some cases when the progress never reach 1.0.
>  This is a list of issues that need to be fixed in the progress code:
>  * After TEZ-3982, since values are skipped in the In some cases, the 
> progress of DAG or a vertex may never reach 1.0f. this is in both 
> "{{DAGImpl.java}}" and "{{ProgressHelper.java}}"
>  * {{ProgressHelper}} schedules a service to update the progress, dubbed 
> `{{ProgressHelper.monitorProgress}}`. According to Java Documentation:
> {quote}If any execution of the task encounters an exception,
>  subsequent executions are suppressed.
>  Otherwise, the task will only terminate via cancellation
>  or termination of the executor.
> {quote}
> In other words, if the service dies, there is no way to catch that in the 
> code and the progress will never be updated.
>  * The `{{SimpleProcessor.inputMap}}` is not thread-safe. They are 
> initialized as `{{LinkedHashMap}}` and there is no synchronization on the 
> field objects in the map. This could be problematic in concurrent context.
>  * `{{VertexImpl.getProgress()}}` does not check the range of the progress 
> calculated in `{{VertexImpl.computeProgress()}}`
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-4103) Progress in DAG, Vertex, and tasks is incorrect

2019-11-27 Thread Ahmed Hussein (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16983810#comment-16983810
 ] 

Ahmed Hussein commented on TEZ-4103:


Changing the data stucture of the inputs into a thread-safe implementation will 
need lots of changes across the source code. It is better to keep that in a 
separate Jira.

> Progress in DAG, Vertex, and tasks is incorrect
> ---
>
> Key: TEZ-4103
> URL: https://issues.apache.org/jira/browse/TEZ-4103
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TEZ-4103.001.patch
>
>
> Looking at the progress code, there some few issues that could lead to some 
> problems calculating the progress.
>  There are some cases when the progress never reach 1.0.
>  This is a list of issues that need to be fixed in the progress code:
>  * After TEZ-3982, since values are skipped in the In some cases, the 
> progress of DAG or a vertex may never reach 1.0f. this is in both 
> "{{DAGImpl.java}}" and "{{ProgressHelper.java}}"
>  * {{ProgressHelper}} schedules a service to update the progress, dubbed 
> `{{ProgressHelper.monitorProgress}}`. According to Java Documentation:
> {quote}If any execution of the task encounters an exception,
>  subsequent executions are suppressed.
>  Otherwise, the task will only terminate via cancellation
>  or termination of the executor.
> {quote}
> In other words, if the service dies, there is no way to catch that in the 
> code and the progress will never be updated.
>  * The `{{SimpleProcessor.inputMap}}` is not thread-safe. They are 
> initialized as `{{LinkedHashMap}}` and there is no synchronization on the 
> field objects in the map. This could be problematic in concurrent context.
>  * `{{VertexImpl.getProgress()}}` does not check the range of the progress 
> calculated in `{{VertexImpl.computeProgress()}}`
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4103) Progress in DAG, Vertex, and tasks is incorrect

2019-11-27 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4103:
---
Attachment: TEZ-4103.001.patch

> Progress in DAG, Vertex, and tasks is incorrect
> ---
>
> Key: TEZ-4103
> URL: https://issues.apache.org/jira/browse/TEZ-4103
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TEZ-4103.001.patch
>
>
> Looking at the progress code, there some few issues that could lead to some 
> problems calculating the progress.
>  There are some cases when the progress never reach 1.0.
>  This is a list of issues that need to be fixed in the progress code:
>  * After TEZ-3982, since values are skipped in the In some cases, the 
> progress of DAG or a vertex may never reach 1.0f. this is in both 
> "{{DAGImpl.java}}" and "{{ProgressHelper.java}}"
>  * {{ProgressHelper}} schedules a service to update the progress, dubbed 
> `{{ProgressHelper.monitorProgress}}`. According to Java Documentation:
> {quote}If any execution of the task encounters an exception,
>  subsequent executions are suppressed.
>  Otherwise, the task will only terminate via cancellation
>  or termination of the executor.
> {quote}
> In other words, if the service dies, there is no way to catch that in the 
> code and the progress will never be updated.
>  * The `{{SimpleProcessor.inputMap}}` is not thread-safe. They are 
> initialized as `{{LinkedHashMap}}` and there is no synchronization on the 
> field objects in the map. This could be problematic in concurrent context.
>  * `{{VertexImpl.getProgress()}}` does not check the range of the progress 
> calculated in `{{VertexImpl.computeProgress()}}`
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TEZ-4103) Progress in DAG, Vertex, and tasks is incorrect

2019-11-27 Thread Ahmed Hussein (Jira)
Ahmed Hussein created TEZ-4103:
--

 Summary: Progress in DAG, Vertex, and tasks is incorrect
 Key: TEZ-4103
 URL: https://issues.apache.org/jira/browse/TEZ-4103
 Project: Apache Tez
  Issue Type: Bug
Reporter: Ahmed Hussein
Assignee: Ahmed Hussein


Looking at the progress code, there some few issues that could lead to some 
problems calculating the progress.
 There are some cases when the progress never reach 1.0.
 This is a list of issues that need to be fixed in the progress code:
 * After TEZ-3982, since values are skipped in the In some cases, the progress 
of DAG or a vertex may never reach 1.0f. this is in both "{{DAGImpl.java}}" and 
"{{ProgressHelper.java}}"
 * {{ProgressHelper}} schedules a service to update the progress, dubbed 
`{{ProgressHelper.monitorProgress}}`. According to Java Documentation:
{quote}If any execution of the task encounters an exception,
 subsequent executions are suppressed.
 Otherwise, the task will only terminate via cancellation
 or termination of the executor.
{quote}
In other words, if the service dies, there is no way to catch that in the code 
and the progress will never be updated.

 * The `{{SimpleProcessor.inputMap}}` is not thread-safe. They are initialized 
as `{{LinkedHashMap}}` and there is no synchronization on the field objects in 
the map. This could be problematic in concurrent context.
 * `{{VertexImpl.getProgress()}}` does not check the range of the progress 
calculated in `{{VertexImpl.computeProgress()}}`
  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-11-26 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4067:
---
Attachment: TEZ-4067.008.patch

> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, 
> TEZ-4067.003.patch, TEZ-4067.004.patch, TEZ-4067.005.patch, 
> TEZ-4067.006.patch, TEZ-4067.007.patch, TEZ-4067.008.patch
>
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-11-26 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4067:
---
Attachment: TEZ-4067.007.patch

> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, 
> TEZ-4067.003.patch, TEZ-4067.004.patch, TEZ-4067.005.patch, 
> TEZ-4067.006.patch, TEZ-4067.007.patch
>
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-11-25 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4067:
---
Attachment: TEZ-4067.006.patch

> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, 
> TEZ-4067.003.patch, TEZ-4067.004.patch, TEZ-4067.005.patch, TEZ-4067.006.patch
>
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-11-20 Thread Ahmed Hussein (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978739#comment-16978739
 ] 

Ahmed Hussein commented on TEZ-4067:


[~jeagles], I tried to refresh my memory a little bit. There was check on the 
service state to prevent starting the service more than once.

The workflow of the {{DAGAppMaster}} works as follow and correct me if I a 
wrong:

* {{DAGAppMaster}} is created
* Services get initialized. this is the phase when the services are added to 
the "{{DAGAppMaster.services}}" map.
* all the services are started inside {{serviceStart.startServices()}}. Note 
that the {{DAG}} is not created yet.
* {{startDag()}} and {{startDagExecution}} finally create the DAG 
"{{currentDAG}}" and its vertices.

This workflow requires that speculators are started and initialized separately 
after the DAG is created. Although, we can still add them to the services map 
though, we cannot assume that they will start automatically in 
{{DAGAppMaster.serviceStart()}}.

Same for {{DAGAppMaster.serviceStop()}}. The latter is called at the end of the 
execution. Therefore, a service in "{{DAGAppMaster.services}}" map will stay 
around until the whole DAG is completed. Given that a vertex can be completed, 
the speculator service related to that vertex will hang around until the 
{{DAGAppMaster}} is completed.
If we add the speculators to "{{DAGAppMaster.services}}", we won't be able to 
remove the service when a vertex is completed, since a {{Vertex/DAGImpl}} does 
not have access to the "{{DAGAppMaster.services}}".

I am almost done with implementing the code based on your suggestions. If you 
think that having speculators stay alive until DAG is completed, then I will go 
ahead and upload the patch. Otherwise, I will work on few changes to remove the 
speculator of a completed vertex.

Let me know WDYT.


> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, 
> TEZ-4067.003.patch, TEZ-4067.004.patch, TEZ-4067.005.patch
>
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-11-19 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4067:
---
Attachment: TEZ-4067.005.patch

> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, 
> TEZ-4067.003.patch, TEZ-4067.004.patch, TEZ-4067.005.patch
>
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-11-15 Thread Ahmed Hussein (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975416#comment-16975416
 ] 

Ahmed Hussein edited comment on TEZ-4067 at 11/15/19 9:34 PM:
--

Thanks Jon!
Sure, I will change that and create a new patch.


was (Author: ahussein):
Thanks Jon!Sure, I will change that and create a new patch.

> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, 
> TEZ-4067.003.patch, TEZ-4067.004.patch
>
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-11-15 Thread Ahmed Hussein (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975416#comment-16975416
 ] 

Ahmed Hussein commented on TEZ-4067:


Thanks Jon!Sure, I will change that and create a new patch.

> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, 
> TEZ-4067.003.patch, TEZ-4067.004.patch
>
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-11-07 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4067:
---
Attachment: TEZ-4067.004.patch

> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, 
> TEZ-4067.003.patch, TEZ-4067.004.patch
>
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-11-07 Thread Ahmed Hussein (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969556#comment-16969556
 ] 

Ahmed Hussein commented on TEZ-4067:


Uploaded a new patch to fix error reported in checkstyle and windbags.

> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, 
> TEZ-4067.003.patch
>
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-11-07 Thread Ahmed Hussein (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969556#comment-16969556
 ] 

Ahmed Hussein edited comment on TEZ-4067 at 11/7/19 9:07 PM:
-

Uploaded a new patch to fix error reported in checkstyle and findbugs.


was (Author: ahussein):
Uploaded a new patch to fix error reported in checkstyle and windbags.

> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, 
> TEZ-4067.003.patch
>
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-11-07 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4067:
---
Attachment: TEZ-4067.003.patch

> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, 
> TEZ-4067.003.patch
>
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-4074) Tez does not run with Hadoop Trunk (3.3.0-snapshot)

2019-05-31 Thread Ahmed Hussein (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853032#comment-16853032
 ] 

Ahmed Hussein commented on TEZ-4074:


Guava27 and 11.0.2 are not source compatible.

For example, Guava27 removed API methods such as 
{{Futures.addCallback(ListenableFuture future, FutureCallback 
callback)}}
 * Guava11.0.2: 
[FutureCallback|https://google.github.io/guava/releases/11.0.2/api/docs/com/google/common/util/concurrent/Futures.html#addCallback(com.google.common.util.concurrent.ListenableFuture,%20com.google.common.util.concurrent.FutureCallback)]
 * Guava27: 
[FutureCallback|https://static.javadoc.io/com.google.guava/guava/27.0.1-jre/com/google/common/util/concurrent/Futures.html#addCallback-com.google.common.util.concurrent.ListenableFuture-com.google.common.util.concurrent.FutureCallback-java.util.concurrent.Executor-]

> Tez does not run with Hadoop Trunk (3.3.0-snapshot)
> ---
>
> Key: TEZ-4074
> URL: https://issues.apache.org/jira/browse/TEZ-4074
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Priority: Major
>
> Tez throws a runtime exception when compiled against Hadoop-3.3.0.
> With Tez running Guava (11.0.2) and Hadoop eunning Guava 27.0-jre (see 
> HADOOP-16210), there is an incompatibility of Guava library.
> {code:java}
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.136 
> s <<< FAILURE! - in org.apache.tez.dag.app.TestSpeculation
> [ERROR] org.apache.tez.dag.app.TestSpeculation Time elapsed: 0.136 s <<< 
> ERROR!
> java.lang.NoSuchMethodError: 
> com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
> at 
> org.apache.tez.dag.app.TestSpeculation.setupSpeculation(TestSpeculation.java:86)
> {code}
> It looks like guava added single parameter optimizations which breaks 
> compatibility with {{VAR_ARGS}}. So, even though it shows source 
> compatibility it is throwing a runtime error due to binary incompatibility.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TEZ-4074) Tez does not run with Hadoop Trunk (3.3.0-snapshot)

2019-05-31 Thread Ahmed Hussein (JIRA)
Ahmed Hussein created TEZ-4074:
--

 Summary: Tez does not run with Hadoop Trunk (3.3.0-snapshot)
 Key: TEZ-4074
 URL: https://issues.apache.org/jira/browse/TEZ-4074
 Project: Apache Tez
  Issue Type: Bug
Reporter: Ahmed Hussein


Tez throws a runtime exception when compiled against Hadoop-3.3.0.

With Tez running Guava (11.0.2) and Hadoop eunning Guava 27.0-jre (see 
HADOOP-16210), there is an incompatibility of Guava library.
{code:java}
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.136 s 
<<< FAILURE! - in org.apache.tez.dag.app.TestSpeculation
[ERROR] org.apache.tez.dag.app.TestSpeculation Time elapsed: 0.136 s <<< ERROR!
java.lang.NoSuchMethodError: 
com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
at 
org.apache.tez.dag.app.TestSpeculation.setupSpeculation(TestSpeculation.java:86)
{code}
It looks like guava added single parameter optimizations which breaks 
compatibility with {{VAR_ARGS}}. So, even though it shows source compatibility 
it is throwing a runtime error due to binary incompatibility.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-05-30 Thread Ahmed Hussein (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4067:
---
Comment: was deleted

(was: TEZ-1897)

> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch
>
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-05-30 Thread Ahmed Hussein (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852045#comment-16852045
 ] 

Ahmed Hussein commented on TEZ-4067:


TEZ-1897

> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch
>
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-05-29 Thread Ahmed Hussein (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4067:
---
Attachment: (was: YARN-4067.002.patch)

> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch
>
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-05-29 Thread Ahmed Hussein (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4067:
---
Attachment: YARN-4067.002.patch

> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch
>
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-05-29 Thread Ahmed Hussein (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4067:
---
Attachment: TEZ-4067.002.patch

> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch
>
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-05-29 Thread Ahmed Hussein (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4067:
---
Attachment: YARN-9563.002.patch

> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: TEZ-4067.001.patch
>
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-05-29 Thread Ahmed Hussein (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4067:
---
Attachment: (was: YARN-9563.002.patch)

> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: TEZ-4067.001.patch
>
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-05-28 Thread Ahmed Hussein (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated TEZ-4067:
---
Attachment: TEZ-4067.001.patch

> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: TEZ-4067.001.patch
>
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TEZ-2164) Shade the guava version used by Tez and move to guava-18

2019-05-28 Thread Ahmed Hussein (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849772#comment-16849772
 ] 

Ahmed Hussein edited comment on TEZ-2164 at 5/28/19 2:29 PM:
-

Hadoop upgraded guava to 27.0-jre (HADOOP-16210).
{code:java}
TEZ running 11.0.2 fails with runtime exceptions 
 [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.136 
s <<< FAILURE! - in org.apache.tez.dag.app.TestSpeculation
 [ERROR] org.apache.tez.dag.app.TestSpeculation Time elapsed: 0.136 s <<< ERROR!
 java.lang.NoSuchMethodError: 
com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
 at 
org.apache.tez.dag.app.TestSpeculation.setupSpeculation(TestSpeculation.java:86){code}


was (Author: ahussein):
Hadoop upgraded guava to 27.0-jre 
([HADOOP-16210|https://issues.apache.org/jira/browse/HADOOP-16210]).

TEZ running 11.0.2 fails with runtime exceptions 
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.136 s 
<<< FAILURE! - in org.apache.tez.dag.app.TestSpeculation
[ERROR] org.apache.tez.dag.app.TestSpeculation Time elapsed: 0.136 s <<< ERROR!
java.lang.NoSuchMethodError: 
com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
at 
org.apache.tez.dag.app.TestSpeculation.setupSpeculation(TestSpeculation.java:86)
 

 

 

> Shade the guava version used by Tez and move to guava-18
> 
>
> Key: TEZ-2164
> URL: https://issues.apache.org/jira/browse/TEZ-2164
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Hitesh Shah
>Priority: Blocker
> Attachments: TEZ-2164.3.patch, TEZ-2164.4.patch, 
> TEZ-2164.wip.2.patch, allow-guava-16.0.1.patch
>
>
> Should allow us to upgrade to a newer version without shipping a guava 
> dependency.
> Would be good to do this in 0.7 so that we stop shipping guava as early as 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-2164) Shade the guava version used by Tez and move to guava-18

2019-05-28 Thread Ahmed Hussein (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849772#comment-16849772
 ] 

Ahmed Hussein commented on TEZ-2164:


Hadoop upgraded guava to 27.0-jre 
([HADOOP-16210|https://issues.apache.org/jira/browse/HADOOP-16210]).

TEZ running 11.0.2 fails with runtime exceptions 
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.136 s 
<<< FAILURE! - in org.apache.tez.dag.app.TestSpeculation
[ERROR] org.apache.tez.dag.app.TestSpeculation Time elapsed: 0.136 s <<< ERROR!
java.lang.NoSuchMethodError: 
com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
at 
org.apache.tez.dag.app.TestSpeculation.setupSpeculation(TestSpeculation.java:86)
 

 

 

> Shade the guava version used by Tez and move to guava-18
> 
>
> Key: TEZ-2164
> URL: https://issues.apache.org/jira/browse/TEZ-2164
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Hitesh Shah
>Priority: Blocker
> Attachments: TEZ-2164.3.patch, TEZ-2164.4.patch, 
> TEZ-2164.wip.2.patch, allow-guava-16.0.1.patch
>
>
> Should allow us to upgrade to a newer version without shipping a guava 
> dependency.
> Would be good to do this in 0.7 so that we stop shipping guava as early as 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-05-24 Thread Ahmed Hussein (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847860#comment-16847860
 ] 

Ahmed Hussein commented on TEZ-4067:


An old [TEZ-3934|https://issues.apache.org/jira/browse/TEZ-3934] reported the 
race condition in the speculator code. When two tasksAttempts are updating 
their progress simultaneously, the speculator may create two speculative 
attempts for the same task.

The jira was closed after adding two more checks on the hashes to verify that 
no attempt was speculated while the current thread is busy with the calculation.

This does not solve the root problem caused by calling maybeSpeculate() after 
updating the progress. A proper fix would be to:
 * The event handler returns after updating the taskAttempt status
 * A separate thread "speculator" runs periodically to scan the tasks within a 
vertex to calculate the speculation.

 

Re-implimenting the speculator as-a-service requires the following changes:
 # add each vertex' speculator to a the list of services in the application 
master (i.e., DAGAppMaster)
 # api/DAG needs to support creating vertex speculator as a service.
 # Test cases (TestSpeculation) may need to be re-written because they were 
designed for single threaded implementation.

 

> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-05-20 Thread Ahmed Hussein (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844220#comment-16844220
 ] 

Ahmed Hussein commented on TEZ-4067:


A concurrent Async dispatcher was added in TEZ-1897 . By default the 
AsyncDispatcher is disabled.

In order to enable the concurrentDispatcher, the TezConfiguration needs to pass 
{noformat}
-Dtez.am.use.concurrent-dispatcher=true  {noformat}
 

 
 # The AsynDispatcher may not be ideal for production because each 
Task/TaskAttmept implies notify event on the blocking queue. For status-updates 
it may be faster to do the update within one thread rather than calling a new 
event between two threads.
 # The frequency of events could overwhelm the pool-workers, and events won't 
be processed on time.
 # For both synchronous and Asynchronous dispatcher, there is no mechanism to 
prevent two different workers scanning the vertex tasks. In that case, workers 
would duplicate the work without any productivity.

 

Suggested fix

 
 # Keep the asyncDispatcher disabled.
 # In legacySpeculator, remove "maybeSpeculate" from 
"notifyAttemptStatusUpdate()". This will prevent the event handler from 
executing the main speculation loop.
 # Create a thread per speculator to execute " maybeSpeculate" every 
"soonestRetryAfterSpeculate/soonestRetryAfterNoSpeculate" 

 

> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-05-09 Thread Ahmed Hussein (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein reassigned TEZ-4067:
--

Assignee: Ahmed Hussein

> Tez Speculation decision is calculated on each update by the dispatcher
> ---
>
> Key: TEZ-4067
> URL: https://issues.apache.org/jira/browse/TEZ-4067
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-05-08 Thread Ahmed Hussein (JIRA)
Ahmed Hussein created TEZ-4067:
--

 Summary: Tez Speculation decision is calculated on each update by 
the dispatcher
 Key: TEZ-4067
 URL: https://issues.apache.org/jira/browse/TEZ-4067
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Ahmed Hussein


LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
handled synchronously by the caller (dispatcher). This implies the following:
 # the dispatcher spends long time executing updateStatus as it needs to check 
the runtime estimation of the tezAttempts within the vertex.
 # the speculator is per stage: lunching a speculation may not the optimum 
decision. Ideally, based on resources, speculated tasks should be the ones with 
slowest progress.
 # the time between speculation is skewed because there is a big delay for the 
dispatcher to complete a full cycle. Also, speculation will be more aggressive 
compared to MR because MR waits for "soonest.retry.after.speculate" whenever a 
task is speculated. On the other hand, Tez speculates more tasks as it 
processes stages in parallel.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)