[jira] [Commented] (TEZ-4349) DAGClient gets stuck with invalid cached DAGStatus
[ https://issues.apache.org/jira/browse/TEZ-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470140#comment-17470140 ] Ahmed Hussein commented on TEZ-4349: Thanks [~abstractdog] for your feedback and for committing the changes! > DAGClient gets stuck with invalid cached DAGStatus > -- > > Key: TEZ-4349 > URL: https://issues.apache.org/jira/browse/TEZ-4349 > Project: Apache Tez > Issue Type: Bug >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Fix For: 0.10.2 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > I found that some Oozie launchers get stuck waiting for the job to complete. > After investigation I found that {{dagClient.getDAGStatus(null)}} calls the > override {{dagClient.getDAGStatus(null, 0)}} , which then calls > {{getDAGStatusInternal}} making use of the cachedDagStatus field. > The cachedDagStatus is never updated causing the launcher to wait > indefinitely. > > [https://github.com/apache/tez/blob/master/tez-api/src/main/java/org/apache/tez/dag/api/client/DAGClientImpl.java#L212] > {code:java} > if (!dagCompleted) { > if (dagStatus != null) { > cachedDagStatus = dagStatus; > return dagStatus; > } > if (cachedDagStatus != null) { > // could not get from AM (not reachable/ was killed). return cached > status. > return cachedDagStatus; > } > } > {code} > +To Fix:+ > The {{cachedDagStatus}} should be valid for a certain amount of time, or > certain number of retires. > When the cachedDAGStatus expires, the DAGClient tries to pull from AM or the > RM. > An error in fetching the status from both AM and RM, would return null to the > caller. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (TEZ-4349) DAGClient gets stuck with invalid cached DAGStatus
Ahmed Hussein created TEZ-4349: -- Summary: DAGClient gets stuck with invalid cached DAGStatus Key: TEZ-4349 URL: https://issues.apache.org/jira/browse/TEZ-4349 Project: Apache Tez Issue Type: Bug Reporter: Ahmed Hussein Assignee: Ahmed Hussein I found that some Oozie launchers get stuck waiting for the job to complete. After investigation I found that {{dagClient.getDAGStatus(null)}} calls the override {{dagClient.getDAGStatus(null, 0)}} , which then calls {{getDAGStatusInternal}} making use of the cachedDagStatus field. The cachedDagStatus is never updated causing the launcher to wait indefinitely. [https://github.com/apache/tez/blob/master/tez-api/src/main/java/org/apache/tez/dag/api/client/DAGClientImpl.java#L212] {code:java} if (!dagCompleted) { if (dagStatus != null) { cachedDagStatus = dagStatus; return dagStatus; } if (cachedDagStatus != null) { // could not get from AM (not reachable/ was killed). return cached status. return cachedDagStatus; } } {code} +To Fix:+ The {{cachedDagStatus}} should be valid for a certain amount of time, or certain number of retires. When the cachedDAGStatus expires, the DAGClient tries to pull from AM or the RM. An error in fetching the status from both AM and RM, would return null to the caller. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (TEZ-4252) 期望当基于tez作为计算引擎,对数据倾斜场景的调优
[ https://issues.apache.org/jira/browse/TEZ-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4252: --- Release Note: Please, add a clear description of what the issue is about? It is not recommended that Jiras have empty descriptions. Also, Can you please change the title and use a translation so that the lira will be searchable? was: [~yang1] Can you please add a clear description of what the issue is about? It is not recommended that Jiras have empty descriptions. Also, Can you please change the title and use a translation so that the lira will be searchable? > 期望当基于tez作为计算引擎,对数据倾斜场景的调优 > - > > Key: TEZ-4252 > URL: https://issues.apache.org/jira/browse/TEZ-4252 > Project: Apache Tez > Issue Type: Wish >Reporter: yang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-4252) 期望当基于tez作为计算引擎,对数据倾斜场景的调优
[ https://issues.apache.org/jira/browse/TEZ-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4252: --- Release Note: [~yang1] Can you please add a clear description of what the issue is about? It is not recommended that Jiras have empty descriptions. Also, Can you please change the title and use a translation so that the lira will be searchable? was: [~yang1] Can you please add a clear description of what the issue is about? It is not recommended that Jiras have empty descriptions. Also, Can you please change the title and use a translation so that the lira will be searchable? > 期望当基于tez作为计算引擎,对数据倾斜场景的调优 > - > > Key: TEZ-4252 > URL: https://issues.apache.org/jira/browse/TEZ-4252 > Project: Apache Tez > Issue Type: Wish >Reporter: yang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (TEZ-4252) 期望当基于tez作为计算引擎,对数据倾斜场景的调优
[ https://issues.apache.org/jira/browse/TEZ-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein resolved TEZ-4252. Release Note: [~yang1] Can you please add a clear description of what the issue is about? It is not recommended that Jiras have empty descriptions. Also, Can you please change the title and use a translation so that the lira will be searchable? Resolution: Invalid > 期望当基于tez作为计算引擎,对数据倾斜场景的调优 > - > > Key: TEZ-4252 > URL: https://issues.apache.org/jira/browse/TEZ-4252 > Project: Apache Tez > Issue Type: Wish >Reporter: yang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TEZ-4119) TestSpeculation is flaky
[ https://issues.apache.org/jira/browse/TEZ-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049306#comment-17049306 ] Ahmed Hussein edited comment on TEZ-4119 at 3/2/20 3:18 PM: Thanks [~abstractdog]. This was very helpful. I had a look at the test case. The main problem with that test case is that it was designed without taking into consideration that the speculator can run as a background service. Once I changed the implementation to make the speculator run in parallel, the test case became fuzzy. It will take me sometime to reimplement the JUnit test according to the new speculator design. was (Author: ahussein): Thanks [~abstractdog]. This was very helpful. I had a look at the test case. The main problem with that test case is that it was designed without taking into consideration that the speculator can run as a background service. > TestSpeculation is flaky > > > Key: TEZ-4119 > URL: https://issues.apache.org/jira/browse/TEZ-4119 > Project: Apache Tez > Issue Type: Improvement >Reporter: László Bodor >Assignee: Ahmed Hussein >Priority: Major > Attachments: jstack.log, jstack4.log, jstack6.log, > org.apache.tez.dag.app.TestSpeculation-output.txt > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-4119) TestSpeculation is flaky
[ https://issues.apache.org/jira/browse/TEZ-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049306#comment-17049306 ] Ahmed Hussein commented on TEZ-4119: Thanks [~abstractdog]. This was very helpful. I had a look at the test case. The main problem with that test case is that it was designed without taking into consideration that the speculator can run as a background service. > TestSpeculation is flaky > > > Key: TEZ-4119 > URL: https://issues.apache.org/jira/browse/TEZ-4119 > Project: Apache Tez > Issue Type: Improvement >Reporter: László Bodor >Assignee: Ahmed Hussein >Priority: Major > Attachments: jstack.log, jstack4.log, jstack6.log, > org.apache.tez.dag.app.TestSpeculation-output.txt > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-4119) TestSpeculation is flaky
[ https://issues.apache.org/jira/browse/TEZ-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045614#comment-17045614 ] Ahmed Hussein commented on TEZ-4119: Hey [~abstractdog], Do you still see this error? > TestSpeculation is flaky > > > Key: TEZ-4119 > URL: https://issues.apache.org/jira/browse/TEZ-4119 > Project: Apache Tez > Issue Type: Improvement >Reporter: László Bodor >Assignee: Ahmed Hussein >Priority: Major > Attachments: jstack.log, jstack4.log, jstack6.log, > org.apache.tez.dag.app.TestSpeculation-output.txt > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-4106) Add Exponential Smooth RuntimeEstimator to the speculator
[ https://issues.apache.org/jira/browse/TEZ-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4106: --- Attachment: TEZ-4106.006.patch > Add Exponential Smooth RuntimeEstimator to the speculator > - > > Key: TEZ-4106 > URL: https://issues.apache.org/jira/browse/TEZ-4106 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Attachments: TEZ-4106.001.patch, TEZ-4106.002.patch, > TEZ-4106.003.patch, TEZ-4106.004.patch, TEZ-4106.005.patch, TEZ-4106.006.patch > > > Tez speculator implements start-end runtime estimator. Similar to > [MAPREDUCE-7208|https://issues.apache.org/jira/browse/MAPREDUCE-7208], we > need to implement an adaptive estimator based on smooth Exponential -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-4119) TestSpeculation is flaky
[ https://issues.apache.org/jira/browse/TEZ-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027833#comment-17027833 ] Ahmed Hussein commented on TEZ-4119: Thanks [~abstractdog], I will take a look. > TestSpeculation is flaky > > > Key: TEZ-4119 > URL: https://issues.apache.org/jira/browse/TEZ-4119 > Project: Apache Tez > Issue Type: Improvement >Reporter: László Bodor >Assignee: László Bodor >Priority: Major > Attachments: jstack.log, jstack4.log, jstack6.log, > org.apache.tez.dag.app.TestSpeculation-output.txt > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (TEZ-4119) TestSpeculation is flaky
[ https://issues.apache.org/jira/browse/TEZ-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein reassigned TEZ-4119: -- Assignee: Ahmed Hussein (was: László Bodor) > TestSpeculation is flaky > > > Key: TEZ-4119 > URL: https://issues.apache.org/jira/browse/TEZ-4119 > Project: Apache Tez > Issue Type: Improvement >Reporter: László Bodor >Assignee: Ahmed Hussein >Priority: Major > Attachments: jstack.log, jstack4.log, jstack6.log, > org.apache.tez.dag.app.TestSpeculation-output.txt > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-4106) Add Exponential Smooth RuntimeEstimator to the speculator
[ https://issues.apache.org/jira/browse/TEZ-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4106: --- Attachment: TEZ-4106.005.patch > Add Exponential Smooth RuntimeEstimator to the speculator > - > > Key: TEZ-4106 > URL: https://issues.apache.org/jira/browse/TEZ-4106 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Attachments: TEZ-4106.001.patch, TEZ-4106.002.patch, > TEZ-4106.003.patch, TEZ-4106.004.patch, TEZ-4106.005.patch > > > Tez speculator implements start-end runtime estimator. Similar to > [MAPREDUCE-7208|https://issues.apache.org/jira/browse/MAPREDUCE-7208], we > need to implement an adaptive estimator based on smooth Exponential -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-4106) Add Exponential Smooth RuntimeEstimator to the speculator
[ https://issues.apache.org/jira/browse/TEZ-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027822#comment-17027822 ] Ahmed Hussein commented on TEZ-4106: Thanks [~jeagles] for the feedback. {quote}Let's clean up the TezConfiguration names if possible. Does it make sense to put them under a top level tez.am.speculation name? Right now there is speculator, speculative, speculation, so it may not be possible to be match perfectly with old configurations. {quote} Sure thing. {quote}Also, TEZ-4119 has been filed to address the flaky tests in TestSpeculation. Do we need to change the patch to account for this? {quote} I will address TEZ-4119 separately without changing the current patch. > Add Exponential Smooth RuntimeEstimator to the speculator > - > > Key: TEZ-4106 > URL: https://issues.apache.org/jira/browse/TEZ-4106 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Attachments: TEZ-4106.001.patch, TEZ-4106.002.patch, > TEZ-4106.003.patch, TEZ-4106.004.patch > > > Tez speculator implements start-end runtime estimator. Similar to > [MAPREDUCE-7208|https://issues.apache.org/jira/browse/MAPREDUCE-7208], we > need to implement an adaptive estimator based on smooth Exponential -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-4119) TestSpeculation is flaky
[ https://issues.apache.org/jira/browse/TEZ-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027808#comment-17027808 ] Ahmed Hussein commented on TEZ-4119: Hi [~abstractdog], thanks for reporting this issue. I recently worked on a similar flaky test case for the MAPREDUCE-7259 (testSpeculateSuccessfulWithUpdateEvents fails Intermittently). I agree with you that this could be caused by timing issues that makes the blocking thread misses the speculator thread. Have you been able to make any progress on that? If not, let me know if you want me to take over. > TestSpeculation is flaky > > > Key: TEZ-4119 > URL: https://issues.apache.org/jira/browse/TEZ-4119 > Project: Apache Tez > Issue Type: Improvement >Reporter: László Bodor >Assignee: László Bodor >Priority: Major > Attachments: jstack.log, jstack4.log, jstack6.log, > org.apache.tez.dag.app.TestSpeculation-output.txt > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-3391) Optimize single split MR split reader
[ https://issues.apache.org/jira/browse/TEZ-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-3391: --- Description: During initialization, each task creates an array of objects \{{TaskSplitMetaInfo[]}}. This represents unnecessary space and time overhead as each task needs only its corresponding split object. Beside the current implementation is \{{n^2}} space complexity, it leaks the inputstream. We need to optimize that implementation by returning only a single object instead of an entire array. [~rohini] suggested the following: {quote} In the vertex construct TaskSplitMetaInfo only for the split of that task instead of constructing for all splits. ie change public static TaskSplitMetaInfo[] readSplitMetaInfo(Configuration conf, FileSystem fs) to public static TaskSplitMetaInfo getSplitMetaInfo(Configuration conf, FileSystem fs, int index) and skip reading splits below the index. If there are 1000 splits, the first task will read 1 split, second task will read 2 splits and so on instead of each task reading all the 1000 splits as is happening now. {quote} was: We had a case where Split metadata size exceeded 1000. Instead of job failing from validation during initialization in AM like mapreduce, each of the tasks failed doing that validation during initialization. Summary: Optimize single split MR split reader (was: MR split file validation should be done in the AM) > Optimize single split MR split reader > - > > Key: TEZ-3391 > URL: https://issues.apache.org/jira/browse/TEZ-3391 > Project: Apache Tez > Issue Type: Bug >Reporter: Rohini Palaniswamy >Assignee: Ahmed Hussein >Priority: Major > Attachments: TEZ-3391.001.patch, TEZ-3391.002.patch > > > During initialization, each task creates an array of objects > \{{TaskSplitMetaInfo[]}}. This represents unnecessary space and time overhead > as each task needs only its corresponding split object. Beside the current > implementation is \{{n^2}} space complexity, it leaks the inputstream. > We need to optimize that implementation by returning only a single object > instead of an entire array. > [~rohini] suggested the following: > {quote} > In the vertex construct TaskSplitMetaInfo only for the split of that task > instead of constructing for all splits. ie change > public static TaskSplitMetaInfo[] readSplitMetaInfo(Configuration conf, > FileSystem fs) to public static TaskSplitMetaInfo > getSplitMetaInfo(Configuration conf, FileSystem fs, int index) and skip > reading splits below the index. If there are 1000 splits, the first task will > read 1 split, second task will read 2 splits and so on instead of each task > reading all the 1000 splits as is happening now. > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TEZ-3391) MR split file validation should be done in the AM
[ https://issues.apache.org/jira/browse/TEZ-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021119#comment-17021119 ] Ahmed Hussein edited comment on TEZ-3391 at 1/22/20 2:38 PM: - I agree with [~rohini] that the implementation is not efficient. The ideal fix is to read the object array {{TaskSplitMetaInfo[]}} only once and do all the validation in the AM, then pass the {{TaskSplitMetaInfo[index]}} to the task initializer. This may imply significant code changes. The existing code also has significant space overhead. Because each task creates an array of meta split. This means the code is {{n^2}} space complexity. The patch will reduce the space complexity but it each task needs to go through the entire meta file. Finally, the code was not closing the InputStream properly. An exception would leak the handler. [~jeagles], Can you please take a look at the patch and merge it at your convenience? was (Author: ahussein): I agree with [~rohini] that the implementation is not efficient. The ideal fix is to read the object array {{TaskSplitMetaInfo[]}} only once and do all the validation in the AM, then pass the {{TaskSplitMetaInfo[index]}} to the task initializer. This may imply significant code changes. The existing code also has significant space overhead. Because each task creates an array of meta split. This means the code is {{n^2}} space complexity. The patch will reduce the space complexity but it each task needs to go through the entire meta file. [~jeagles], Can you please take a look at the patch and merge it at your convenience? > MR split file validation should be done in the AM > - > > Key: TEZ-3391 > URL: https://issues.apache.org/jira/browse/TEZ-3391 > Project: Apache Tez > Issue Type: Bug >Reporter: Rohini Palaniswamy >Assignee: Ahmed Hussein >Priority: Major > Attachments: TEZ-3391.001.patch, TEZ-3391.002.patch > > > We had a case where Split metadata size exceeded 1000. Instead of job > failing from validation during initialization in AM like mapreduce, each of > the tasks failed doing that validation during initialization. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-3391) MR split file validation should be done in the AM
[ https://issues.apache.org/jira/browse/TEZ-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021119#comment-17021119 ] Ahmed Hussein commented on TEZ-3391: I agree with [~rohini] that the implementation is not efficient. The ideal fix is to read the object array {{TaskSplitMetaInfo[]}} only once and do all the validation in the AM, then pass the {{TaskSplitMetaInfo[index]}} to the task initializer. This may imply significant code changes. The existing code also has significant space overhead. Because each task creates an array of meta split. This means the code is {{n^2}} space complexity. The patch will reduce the space complexity but it each task needs to go through the entire meta file. [~jeagles], Can you please take a look at the patch and merge it at your convenience? > MR split file validation should be done in the AM > - > > Key: TEZ-3391 > URL: https://issues.apache.org/jira/browse/TEZ-3391 > Project: Apache Tez > Issue Type: Bug >Reporter: Rohini Palaniswamy >Assignee: Ahmed Hussein >Priority: Major > Attachments: TEZ-3391.001.patch, TEZ-3391.002.patch > > > We had a case where Split metadata size exceeded 1000. Instead of job > failing from validation during initialization in AM like mapreduce, each of > the tasks failed doing that validation during initialization. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-3391) MR split file validation should be done in the AM
[ https://issues.apache.org/jira/browse/TEZ-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-3391: --- Attachment: TEZ-3391.002.patch > MR split file validation should be done in the AM > - > > Key: TEZ-3391 > URL: https://issues.apache.org/jira/browse/TEZ-3391 > Project: Apache Tez > Issue Type: Bug >Reporter: Rohini Palaniswamy >Assignee: Ahmed Hussein >Priority: Major > Attachments: TEZ-3391.001.patch, TEZ-3391.002.patch > > > We had a case where Split metadata size exceeded 1000. Instead of job > failing from validation during initialization in AM like mapreduce, each of > the tasks failed doing that validation during initialization. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-3391) MR split file validation should be done in the AM
[ https://issues.apache.org/jira/browse/TEZ-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-3391: --- Attachment: TEZ-3391.001.patch > MR split file validation should be done in the AM > - > > Key: TEZ-3391 > URL: https://issues.apache.org/jira/browse/TEZ-3391 > Project: Apache Tez > Issue Type: Bug >Reporter: Rohini Palaniswamy >Assignee: Ahmed Hussein >Priority: Major > Attachments: TEZ-3391.001.patch > > > We had a case where Split metadata size exceeded 1000. Instead of job > failing from validation during initialization in AM like mapreduce, each of > the tasks failed doing that validation during initialization. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (TEZ-3391) MR split file validation should be done in the AM
[ https://issues.apache.org/jira/browse/TEZ-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein reassigned TEZ-3391: -- Assignee: Ahmed Hussein (was: Nishant Dash) > MR split file validation should be done in the AM > - > > Key: TEZ-3391 > URL: https://issues.apache.org/jira/browse/TEZ-3391 > Project: Apache Tez > Issue Type: Bug >Reporter: Rohini Palaniswamy >Assignee: Ahmed Hussein >Priority: Major > > We had a case where Split metadata size exceeded 1000. Instead of job > failing from validation during initialization in AM like mapreduce, each of > the tasks failed doing that validation during initialization. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-4106) Add Exponential Smooth RuntimeEstimator to the speculator
[ https://issues.apache.org/jira/browse/TEZ-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005483#comment-17005483 ] Ahmed Hussein commented on TEZ-4106: [~jeagles] Can you please review the patch? > Add Exponential Smooth RuntimeEstimator to the speculator > - > > Key: TEZ-4106 > URL: https://issues.apache.org/jira/browse/TEZ-4106 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Attachments: TEZ-4106.001.patch, TEZ-4106.002.patch, > TEZ-4106.003.patch, TEZ-4106.004.patch > > > Tez speculator implements start-end runtime estimator. Similar to > [MAPREDUCE-7208|https://issues.apache.org/jira/browse/MAPREDUCE-7208], we > need to implement an adaptive estimator based on smooth Exponential -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-4106) Add Exponential Smooth RuntimeEstimator to the speculator
[ https://issues.apache.org/jira/browse/TEZ-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4106: --- Attachment: TEZ-4106.004.patch > Add Exponential Smooth RuntimeEstimator to the speculator > - > > Key: TEZ-4106 > URL: https://issues.apache.org/jira/browse/TEZ-4106 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Attachments: TEZ-4106.001.patch, TEZ-4106.002.patch, > TEZ-4106.003.patch, TEZ-4106.004.patch > > > Tez speculator implements start-end runtime estimator. Similar to > [MAPREDUCE-7208|https://issues.apache.org/jira/browse/MAPREDUCE-7208], we > need to implement an adaptive estimator based on smooth Exponential -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-4106) Add Exponential Smooth RuntimeEstimator to the speculator
[ https://issues.apache.org/jira/browse/TEZ-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4106: --- Attachment: TEZ-4106.003.patch > Add Exponential Smooth RuntimeEstimator to the speculator > - > > Key: TEZ-4106 > URL: https://issues.apache.org/jira/browse/TEZ-4106 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Attachments: TEZ-4106.001.patch, TEZ-4106.002.patch, > TEZ-4106.003.patch > > > Tez speculator implements start-end runtime estimator. Similar to > [MAPREDUCE-7208|https://issues.apache.org/jira/browse/MAPREDUCE-7208], we > need to implement an adaptive estimator based on smooth Exponential -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-4106) Add Exponential Smooth RuntimeEstimator to the speculator
[ https://issues.apache.org/jira/browse/TEZ-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4106: --- Attachment: TEZ-4106.002.patch > Add Exponential Smooth RuntimeEstimator to the speculator > - > > Key: TEZ-4106 > URL: https://issues.apache.org/jira/browse/TEZ-4106 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Attachments: TEZ-4106.001.patch, TEZ-4106.002.patch > > > Tez speculator implements start-end runtime estimator. Similar to > [MAPREDUCE-7208|https://issues.apache.org/jira/browse/MAPREDUCE-7208], we > need to implement an adaptive estimator based on smooth Exponential -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-4106) Add Exponential Smooth RuntimeEstimator to the speculator
[ https://issues.apache.org/jira/browse/TEZ-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4106: --- Description: Tez speculator implements start-end runtime estimator. Similar to [MAPREDUCE-7208|https://issues.apache.org/jira/browse/MAPREDUCE-7208], we need to implement an adaptive estimator based on smooth Exponential > Add Exponential Smooth RuntimeEstimator to the speculator > - > > Key: TEZ-4106 > URL: https://issues.apache.org/jira/browse/TEZ-4106 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > > Tez speculator implements start-end runtime estimator. Similar to > [MAPREDUCE-7208|https://issues.apache.org/jira/browse/MAPREDUCE-7208], we > need to implement an adaptive estimator based on smooth Exponential -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TEZ-4106) Add Exponential Smooth RuntimeEstimator to the speculator
Ahmed Hussein created TEZ-4106: -- Summary: Add Exponential Smooth RuntimeEstimator to the speculator Key: TEZ-4106 URL: https://issues.apache.org/jira/browse/TEZ-4106 Project: Apache Tez Issue Type: Improvement Reporter: Ahmed Hussein Assignee: Ahmed Hussein -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-4103) Progress in DAG, Vertex, and tasks is incorrect
[ https://issues.apache.org/jira/browse/TEZ-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4103: --- Attachment: TEZ-4103.006.patch > Progress in DAG, Vertex, and tasks is incorrect > --- > > Key: TEZ-4103 > URL: https://issues.apache.org/jira/browse/TEZ-4103 > Project: Apache Tez > Issue Type: Bug >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Attachments: TEZ-4103.001.patch, TEZ-4103.002.patch, > TEZ-4103.003.patch, TEZ-4103.004.patch, TEZ-4103.005.patch, TEZ-4103.006.patch > > > Looking at the progress code, there some few issues that could lead to some > problems calculating the progress. > There are some cases when the progress never reach 1.0. > This is a list of issues that need to be fixed in the progress code: > * After TEZ-3982, since values are skipped in the In some cases, the > progress of DAG or a vertex may never reach 1.0f. this is in both > "{{DAGImpl.java}}" and "{{ProgressHelper.java}}" > * {{ProgressHelper}} schedules a service to update the progress, dubbed > `{{ProgressHelper.monitorProgress}}`. According to Java Documentation: > {quote}If any execution of the task encounters an exception, > subsequent executions are suppressed. > Otherwise, the task will only terminate via cancellation > or termination of the executor. > {quote} > In other words, if the service dies, there is no way to catch that in the > code and the progress will never be updated. > * The `{{SimpleProcessor.inputMap}}` is not thread-safe. They are > initialized as `{{LinkedHashMap}}` and there is no synchronization on the > field objects in the map. This could be problematic in concurrent context. > * `{{VertexImpl.getProgress()}}` does not check the range of the progress > calculated in `{{VertexImpl.computeProgress()}}` > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-4103) Progress in DAG, Vertex, and tasks is incorrect
[ https://issues.apache.org/jira/browse/TEZ-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4103: --- Attachment: TEZ-4103.005.patch > Progress in DAG, Vertex, and tasks is incorrect > --- > > Key: TEZ-4103 > URL: https://issues.apache.org/jira/browse/TEZ-4103 > Project: Apache Tez > Issue Type: Bug >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Attachments: TEZ-4103.001.patch, TEZ-4103.002.patch, > TEZ-4103.003.patch, TEZ-4103.004.patch, TEZ-4103.005.patch > > > Looking at the progress code, there some few issues that could lead to some > problems calculating the progress. > There are some cases when the progress never reach 1.0. > This is a list of issues that need to be fixed in the progress code: > * After TEZ-3982, since values are skipped in the In some cases, the > progress of DAG or a vertex may never reach 1.0f. this is in both > "{{DAGImpl.java}}" and "{{ProgressHelper.java}}" > * {{ProgressHelper}} schedules a service to update the progress, dubbed > `{{ProgressHelper.monitorProgress}}`. According to Java Documentation: > {quote}If any execution of the task encounters an exception, > subsequent executions are suppressed. > Otherwise, the task will only terminate via cancellation > or termination of the executor. > {quote} > In other words, if the service dies, there is no way to catch that in the > code and the progress will never be updated. > * The `{{SimpleProcessor.inputMap}}` is not thread-safe. They are > initialized as `{{LinkedHashMap}}` and there is no synchronization on the > field objects in the map. This could be problematic in concurrent context. > * `{{VertexImpl.getProgress()}}` does not check the range of the progress > calculated in `{{VertexImpl.computeProgress()}}` > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-4103) Progress in DAG, Vertex, and tasks is incorrect
[ https://issues.apache.org/jira/browse/TEZ-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4103: --- Attachment: TEZ-4103.004.patch > Progress in DAG, Vertex, and tasks is incorrect > --- > > Key: TEZ-4103 > URL: https://issues.apache.org/jira/browse/TEZ-4103 > Project: Apache Tez > Issue Type: Bug >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Attachments: TEZ-4103.001.patch, TEZ-4103.002.patch, > TEZ-4103.003.patch, TEZ-4103.004.patch > > > Looking at the progress code, there some few issues that could lead to some > problems calculating the progress. > There are some cases when the progress never reach 1.0. > This is a list of issues that need to be fixed in the progress code: > * After TEZ-3982, since values are skipped in the In some cases, the > progress of DAG or a vertex may never reach 1.0f. this is in both > "{{DAGImpl.java}}" and "{{ProgressHelper.java}}" > * {{ProgressHelper}} schedules a service to update the progress, dubbed > `{{ProgressHelper.monitorProgress}}`. According to Java Documentation: > {quote}If any execution of the task encounters an exception, > subsequent executions are suppressed. > Otherwise, the task will only terminate via cancellation > or termination of the executor. > {quote} > In other words, if the service dies, there is no way to catch that in the > code and the progress will never be updated. > * The `{{SimpleProcessor.inputMap}}` is not thread-safe. They are > initialized as `{{LinkedHashMap}}` and there is no synchronization on the > field objects in the map. This could be problematic in concurrent context. > * `{{VertexImpl.getProgress()}}` does not check the range of the progress > calculated in `{{VertexImpl.computeProgress()}}` > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-4103) Progress in DAG, Vertex, and tasks is incorrect
[ https://issues.apache.org/jira/browse/TEZ-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986442#comment-16986442 ] Ahmed Hussein commented on TEZ-4103: {quote}I can see this patch goes through great effort to centralize logging into the ProgressHelper. However, it adds IMHO unnecessarily complex code by using lambdas in log statements as well as separates the condition checking and logging from its origin of error. Unless I'm missing the necessity, I think this code becomes much simpler with a simple if isDebugEnabled() check followed by parameterized LOG.debug statement. Once this is done we can remove the logDebug helper methods. {quote} I thought that lambda expressions reduce the overhead because the expression (i.e., parameters to the lambda expression and string formatting) won't be evaluated until the {{fn.apply()}} is called. I will replace the lambda with simple {{isDebugEnabled()}}. Yet, we need a way to aggregate the progress logging to make it easy to debug. For example, when we use {{isDebugEnabled()}} we will need to enable the logging for all classes that have {{getProgress()}} method. On the other hand, logging in one class makes it easy to enable/disable the debugging of {{getProgress()}}. {quote}I also wondered about the thread monitoring. Can you help me to understand why a catch (Throwable) wasn't sufficient. As per https://stackoverflow.com/a/24902026. Seems like (though I am not positive) we have created a thread to monitor the other thread.{quote} I was confused by the java doc thinking that the future invocation will halt as long as the thread exception in the JVM has been set. I will simplify the code by removing the re-launching piece. {quote}Functionally, it isn't incorrect to use a LogicalInput that isn't AbstractLogicalInput. While I like logging the non-compliant class as speculative execution is very limited in that scenario, is it too excessive to log that condition every time?{quote} I saw in the javaDoc that {{AbstractLogicalInput}} has to be the base for all implementations. If that's the design, then it should be incorrect to have different implementations. {code:java} /** * An abstract class which should be the base class for all implementations of LogicalInput. * * This class implements the framework facing as well as user facing methods which need to be * implemented by all LogicalInputs. {code} > Progress in DAG, Vertex, and tasks is incorrect > --- > > Key: TEZ-4103 > URL: https://issues.apache.org/jira/browse/TEZ-4103 > Project: Apache Tez > Issue Type: Bug >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Attachments: TEZ-4103.001.patch, TEZ-4103.002.patch, > TEZ-4103.003.patch > > > Looking at the progress code, there some few issues that could lead to some > problems calculating the progress. > There are some cases when the progress never reach 1.0. > This is a list of issues that need to be fixed in the progress code: > * After TEZ-3982, since values are skipped in the In some cases, the > progress of DAG or a vertex may never reach 1.0f. this is in both > "{{DAGImpl.java}}" and "{{ProgressHelper.java}}" > * {{ProgressHelper}} schedules a service to update the progress, dubbed > `{{ProgressHelper.monitorProgress}}`. According to Java Documentation: > {quote}If any execution of the task encounters an exception, > subsequent executions are suppressed. > Otherwise, the task will only terminate via cancellation > or termination of the executor. > {quote} > In other words, if the service dies, there is no way to catch that in the > code and the progress will never be updated. > * The `{{SimpleProcessor.inputMap}}` is not thread-safe. They are > initialized as `{{LinkedHashMap}}` and there is no synchronization on the > field objects in the map. This could be problematic in concurrent context. > * `{{VertexImpl.getProgress()}}` does not check the range of the progress > calculated in `{{VertexImpl.computeProgress()}}` > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-4103) Progress in DAG, Vertex, and tasks is incorrect
[ https://issues.apache.org/jira/browse/TEZ-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4103: --- Attachment: TEZ-4103.003.patch > Progress in DAG, Vertex, and tasks is incorrect > --- > > Key: TEZ-4103 > URL: https://issues.apache.org/jira/browse/TEZ-4103 > Project: Apache Tez > Issue Type: Bug >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Attachments: TEZ-4103.001.patch, TEZ-4103.002.patch, > TEZ-4103.003.patch > > > Looking at the progress code, there some few issues that could lead to some > problems calculating the progress. > There are some cases when the progress never reach 1.0. > This is a list of issues that need to be fixed in the progress code: > * After TEZ-3982, since values are skipped in the In some cases, the > progress of DAG or a vertex may never reach 1.0f. this is in both > "{{DAGImpl.java}}" and "{{ProgressHelper.java}}" > * {{ProgressHelper}} schedules a service to update the progress, dubbed > `{{ProgressHelper.monitorProgress}}`. According to Java Documentation: > {quote}If any execution of the task encounters an exception, > subsequent executions are suppressed. > Otherwise, the task will only terminate via cancellation > or termination of the executor. > {quote} > In other words, if the service dies, there is no way to catch that in the > code and the progress will never be updated. > * The `{{SimpleProcessor.inputMap}}` is not thread-safe. They are > initialized as `{{LinkedHashMap}}` and there is no synchronization on the > field objects in the map. This could be problematic in concurrent context. > * `{{VertexImpl.getProgress()}}` does not check the range of the progress > calculated in `{{VertexImpl.computeProgress()}}` > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-4103) Progress in DAG, Vertex, and tasks is incorrect
[ https://issues.apache.org/jira/browse/TEZ-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4103: --- Attachment: TEZ-4103.002.patch > Progress in DAG, Vertex, and tasks is incorrect > --- > > Key: TEZ-4103 > URL: https://issues.apache.org/jira/browse/TEZ-4103 > Project: Apache Tez > Issue Type: Bug >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Attachments: TEZ-4103.001.patch, TEZ-4103.002.patch > > > Looking at the progress code, there some few issues that could lead to some > problems calculating the progress. > There are some cases when the progress never reach 1.0. > This is a list of issues that need to be fixed in the progress code: > * After TEZ-3982, since values are skipped in the In some cases, the > progress of DAG or a vertex may never reach 1.0f. this is in both > "{{DAGImpl.java}}" and "{{ProgressHelper.java}}" > * {{ProgressHelper}} schedules a service to update the progress, dubbed > `{{ProgressHelper.monitorProgress}}`. According to Java Documentation: > {quote}If any execution of the task encounters an exception, > subsequent executions are suppressed. > Otherwise, the task will only terminate via cancellation > or termination of the executor. > {quote} > In other words, if the service dies, there is no way to catch that in the > code and the progress will never be updated. > * The `{{SimpleProcessor.inputMap}}` is not thread-safe. They are > initialized as `{{LinkedHashMap}}` and there is no synchronization on the > field objects in the map. This could be problematic in concurrent context. > * `{{VertexImpl.getProgress()}}` does not check the range of the progress > calculated in `{{VertexImpl.computeProgress()}}` > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-4103) Progress in DAG, Vertex, and tasks is incorrect
[ https://issues.apache.org/jira/browse/TEZ-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16983810#comment-16983810 ] Ahmed Hussein commented on TEZ-4103: Changing the data stucture of the inputs into a thread-safe implementation will need lots of changes across the source code. It is better to keep that in a separate Jira. > Progress in DAG, Vertex, and tasks is incorrect > --- > > Key: TEZ-4103 > URL: https://issues.apache.org/jira/browse/TEZ-4103 > Project: Apache Tez > Issue Type: Bug >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Attachments: TEZ-4103.001.patch > > > Looking at the progress code, there some few issues that could lead to some > problems calculating the progress. > There are some cases when the progress never reach 1.0. > This is a list of issues that need to be fixed in the progress code: > * After TEZ-3982, since values are skipped in the In some cases, the > progress of DAG or a vertex may never reach 1.0f. this is in both > "{{DAGImpl.java}}" and "{{ProgressHelper.java}}" > * {{ProgressHelper}} schedules a service to update the progress, dubbed > `{{ProgressHelper.monitorProgress}}`. According to Java Documentation: > {quote}If any execution of the task encounters an exception, > subsequent executions are suppressed. > Otherwise, the task will only terminate via cancellation > or termination of the executor. > {quote} > In other words, if the service dies, there is no way to catch that in the > code and the progress will never be updated. > * The `{{SimpleProcessor.inputMap}}` is not thread-safe. They are > initialized as `{{LinkedHashMap}}` and there is no synchronization on the > field objects in the map. This could be problematic in concurrent context. > * `{{VertexImpl.getProgress()}}` does not check the range of the progress > calculated in `{{VertexImpl.computeProgress()}}` > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-4103) Progress in DAG, Vertex, and tasks is incorrect
[ https://issues.apache.org/jira/browse/TEZ-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4103: --- Attachment: TEZ-4103.001.patch > Progress in DAG, Vertex, and tasks is incorrect > --- > > Key: TEZ-4103 > URL: https://issues.apache.org/jira/browse/TEZ-4103 > Project: Apache Tez > Issue Type: Bug >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Attachments: TEZ-4103.001.patch > > > Looking at the progress code, there some few issues that could lead to some > problems calculating the progress. > There are some cases when the progress never reach 1.0. > This is a list of issues that need to be fixed in the progress code: > * After TEZ-3982, since values are skipped in the In some cases, the > progress of DAG or a vertex may never reach 1.0f. this is in both > "{{DAGImpl.java}}" and "{{ProgressHelper.java}}" > * {{ProgressHelper}} schedules a service to update the progress, dubbed > `{{ProgressHelper.monitorProgress}}`. According to Java Documentation: > {quote}If any execution of the task encounters an exception, > subsequent executions are suppressed. > Otherwise, the task will only terminate via cancellation > or termination of the executor. > {quote} > In other words, if the service dies, there is no way to catch that in the > code and the progress will never be updated. > * The `{{SimpleProcessor.inputMap}}` is not thread-safe. They are > initialized as `{{LinkedHashMap}}` and there is no synchronization on the > field objects in the map. This could be problematic in concurrent context. > * `{{VertexImpl.getProgress()}}` does not check the range of the progress > calculated in `{{VertexImpl.computeProgress()}}` > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TEZ-4103) Progress in DAG, Vertex, and tasks is incorrect
Ahmed Hussein created TEZ-4103: -- Summary: Progress in DAG, Vertex, and tasks is incorrect Key: TEZ-4103 URL: https://issues.apache.org/jira/browse/TEZ-4103 Project: Apache Tez Issue Type: Bug Reporter: Ahmed Hussein Assignee: Ahmed Hussein Looking at the progress code, there some few issues that could lead to some problems calculating the progress. There are some cases when the progress never reach 1.0. This is a list of issues that need to be fixed in the progress code: * After TEZ-3982, since values are skipped in the In some cases, the progress of DAG or a vertex may never reach 1.0f. this is in both "{{DAGImpl.java}}" and "{{ProgressHelper.java}}" * {{ProgressHelper}} schedules a service to update the progress, dubbed `{{ProgressHelper.monitorProgress}}`. According to Java Documentation: {quote}If any execution of the task encounters an exception, subsequent executions are suppressed. Otherwise, the task will only terminate via cancellation or termination of the executor. {quote} In other words, if the service dies, there is no way to catch that in the code and the progress will never be updated. * The `{{SimpleProcessor.inputMap}}` is not thread-safe. They are initialized as `{{LinkedHashMap}}` and there is no synchronization on the field objects in the map. This could be problematic in concurrent context. * `{{VertexImpl.getProgress()}}` does not check the range of the progress calculated in `{{VertexImpl.computeProgress()}}` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4067: --- Attachment: TEZ-4067.008.patch > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, > TEZ-4067.003.patch, TEZ-4067.004.patch, TEZ-4067.005.patch, > TEZ-4067.006.patch, TEZ-4067.007.patch, TEZ-4067.008.patch > > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4067: --- Attachment: TEZ-4067.007.patch > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, > TEZ-4067.003.patch, TEZ-4067.004.patch, TEZ-4067.005.patch, > TEZ-4067.006.patch, TEZ-4067.007.patch > > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4067: --- Attachment: TEZ-4067.006.patch > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, > TEZ-4067.003.patch, TEZ-4067.004.patch, TEZ-4067.005.patch, TEZ-4067.006.patch > > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978739#comment-16978739 ] Ahmed Hussein commented on TEZ-4067: [~jeagles], I tried to refresh my memory a little bit. There was check on the service state to prevent starting the service more than once. The workflow of the {{DAGAppMaster}} works as follow and correct me if I a wrong: * {{DAGAppMaster}} is created * Services get initialized. this is the phase when the services are added to the "{{DAGAppMaster.services}}" map. * all the services are started inside {{serviceStart.startServices()}}. Note that the {{DAG}} is not created yet. * {{startDag()}} and {{startDagExecution}} finally create the DAG "{{currentDAG}}" and its vertices. This workflow requires that speculators are started and initialized separately after the DAG is created. Although, we can still add them to the services map though, we cannot assume that they will start automatically in {{DAGAppMaster.serviceStart()}}. Same for {{DAGAppMaster.serviceStop()}}. The latter is called at the end of the execution. Therefore, a service in "{{DAGAppMaster.services}}" map will stay around until the whole DAG is completed. Given that a vertex can be completed, the speculator service related to that vertex will hang around until the {{DAGAppMaster}} is completed. If we add the speculators to "{{DAGAppMaster.services}}", we won't be able to remove the service when a vertex is completed, since a {{Vertex/DAGImpl}} does not have access to the "{{DAGAppMaster.services}}". I am almost done with implementing the code based on your suggestions. If you think that having speculators stay alive until DAG is completed, then I will go ahead and upload the patch. Otherwise, I will work on few changes to remove the speculator of a completed vertex. Let me know WDYT. > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, > TEZ-4067.003.patch, TEZ-4067.004.patch, TEZ-4067.005.patch > > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4067: --- Attachment: TEZ-4067.005.patch > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, > TEZ-4067.003.patch, TEZ-4067.004.patch, TEZ-4067.005.patch > > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975416#comment-16975416 ] Ahmed Hussein edited comment on TEZ-4067 at 11/15/19 9:34 PM: -- Thanks Jon! Sure, I will change that and create a new patch. was (Author: ahussein): Thanks Jon!Sure, I will change that and create a new patch. > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, > TEZ-4067.003.patch, TEZ-4067.004.patch > > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975416#comment-16975416 ] Ahmed Hussein commented on TEZ-4067: Thanks Jon!Sure, I will change that and create a new patch. > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, > TEZ-4067.003.patch, TEZ-4067.004.patch > > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4067: --- Attachment: TEZ-4067.004.patch > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, > TEZ-4067.003.patch, TEZ-4067.004.patch > > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969556#comment-16969556 ] Ahmed Hussein commented on TEZ-4067: Uploaded a new patch to fix error reported in checkstyle and windbags. > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, > TEZ-4067.003.patch > > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969556#comment-16969556 ] Ahmed Hussein edited comment on TEZ-4067 at 11/7/19 9:07 PM: - Uploaded a new patch to fix error reported in checkstyle and findbugs. was (Author: ahussein): Uploaded a new patch to fix error reported in checkstyle and windbags. > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, > TEZ-4067.003.patch > > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4067: --- Attachment: TEZ-4067.003.patch > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch, > TEZ-4067.003.patch > > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-4074) Tez does not run with Hadoop Trunk (3.3.0-snapshot)
[ https://issues.apache.org/jira/browse/TEZ-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853032#comment-16853032 ] Ahmed Hussein commented on TEZ-4074: Guava27 and 11.0.2 are not source compatible. For example, Guava27 removed API methods such as {{Futures.addCallback(ListenableFuture future, FutureCallback callback)}} * Guava11.0.2: [FutureCallback|https://google.github.io/guava/releases/11.0.2/api/docs/com/google/common/util/concurrent/Futures.html#addCallback(com.google.common.util.concurrent.ListenableFuture,%20com.google.common.util.concurrent.FutureCallback)] * Guava27: [FutureCallback|https://static.javadoc.io/com.google.guava/guava/27.0.1-jre/com/google/common/util/concurrent/Futures.html#addCallback-com.google.common.util.concurrent.ListenableFuture-com.google.common.util.concurrent.FutureCallback-java.util.concurrent.Executor-] > Tez does not run with Hadoop Trunk (3.3.0-snapshot) > --- > > Key: TEZ-4074 > URL: https://issues.apache.org/jira/browse/TEZ-4074 > Project: Apache Tez > Issue Type: Bug >Reporter: Ahmed Hussein >Priority: Major > > Tez throws a runtime exception when compiled against Hadoop-3.3.0. > With Tez running Guava (11.0.2) and Hadoop eunning Guava 27.0-jre (see > HADOOP-16210), there is an incompatibility of Guava library. > {code:java} > [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.136 > s <<< FAILURE! - in org.apache.tez.dag.app.TestSpeculation > [ERROR] org.apache.tez.dag.app.TestSpeculation Time elapsed: 0.136 s <<< > ERROR! > java.lang.NoSuchMethodError: > com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V > at > org.apache.tez.dag.app.TestSpeculation.setupSpeculation(TestSpeculation.java:86) > {code} > It looks like guava added single parameter optimizations which breaks > compatibility with {{VAR_ARGS}}. So, even though it shows source > compatibility it is throwing a runtime error due to binary incompatibility. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TEZ-4074) Tez does not run with Hadoop Trunk (3.3.0-snapshot)
Ahmed Hussein created TEZ-4074: -- Summary: Tez does not run with Hadoop Trunk (3.3.0-snapshot) Key: TEZ-4074 URL: https://issues.apache.org/jira/browse/TEZ-4074 Project: Apache Tez Issue Type: Bug Reporter: Ahmed Hussein Tez throws a runtime exception when compiled against Hadoop-3.3.0. With Tez running Guava (11.0.2) and Hadoop eunning Guava 27.0-jre (see HADOOP-16210), there is an incompatibility of Guava library. {code:java} [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.136 s <<< FAILURE! - in org.apache.tez.dag.app.TestSpeculation [ERROR] org.apache.tez.dag.app.TestSpeculation Time elapsed: 0.136 s <<< ERROR! java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V at org.apache.tez.dag.app.TestSpeculation.setupSpeculation(TestSpeculation.java:86) {code} It looks like guava added single parameter optimizations which breaks compatibility with {{VAR_ARGS}}. So, even though it shows source compatibility it is throwing a runtime error due to binary incompatibility. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4067: --- Comment: was deleted (was: TEZ-1897) > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch > > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852045#comment-16852045 ] Ahmed Hussein commented on TEZ-4067: TEZ-1897 > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch > > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4067: --- Attachment: (was: YARN-4067.002.patch) > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch > > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4067: --- Attachment: YARN-4067.002.patch > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch > > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4067: --- Attachment: TEZ-4067.002.patch > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > Attachments: TEZ-4067.001.patch, TEZ-4067.002.patch > > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4067: --- Attachment: YARN-9563.002.patch > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > Attachments: TEZ-4067.001.patch > > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4067: --- Attachment: (was: YARN-9563.002.patch) > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > Attachments: TEZ-4067.001.patch > > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated TEZ-4067: --- Attachment: TEZ-4067.001.patch > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > Attachments: TEZ-4067.001.patch > > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TEZ-2164) Shade the guava version used by Tez and move to guava-18
[ https://issues.apache.org/jira/browse/TEZ-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849772#comment-16849772 ] Ahmed Hussein edited comment on TEZ-2164 at 5/28/19 2:29 PM: - Hadoop upgraded guava to 27.0-jre (HADOOP-16210). {code:java} TEZ running 11.0.2 fails with runtime exceptions [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.136 s <<< FAILURE! - in org.apache.tez.dag.app.TestSpeculation [ERROR] org.apache.tez.dag.app.TestSpeculation Time elapsed: 0.136 s <<< ERROR! java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V at org.apache.tez.dag.app.TestSpeculation.setupSpeculation(TestSpeculation.java:86){code} was (Author: ahussein): Hadoop upgraded guava to 27.0-jre ([HADOOP-16210|https://issues.apache.org/jira/browse/HADOOP-16210]). TEZ running 11.0.2 fails with runtime exceptions [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.136 s <<< FAILURE! - in org.apache.tez.dag.app.TestSpeculation [ERROR] org.apache.tez.dag.app.TestSpeculation Time elapsed: 0.136 s <<< ERROR! java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V at org.apache.tez.dag.app.TestSpeculation.setupSpeculation(TestSpeculation.java:86) > Shade the guava version used by Tez and move to guava-18 > > > Key: TEZ-2164 > URL: https://issues.apache.org/jira/browse/TEZ-2164 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Hitesh Shah >Priority: Blocker > Attachments: TEZ-2164.3.patch, TEZ-2164.4.patch, > TEZ-2164.wip.2.patch, allow-guava-16.0.1.patch > > > Should allow us to upgrade to a newer version without shipping a guava > dependency. > Would be good to do this in 0.7 so that we stop shipping guava as early as > possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-2164) Shade the guava version used by Tez and move to guava-18
[ https://issues.apache.org/jira/browse/TEZ-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849772#comment-16849772 ] Ahmed Hussein commented on TEZ-2164: Hadoop upgraded guava to 27.0-jre ([HADOOP-16210|https://issues.apache.org/jira/browse/HADOOP-16210]). TEZ running 11.0.2 fails with runtime exceptions [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.136 s <<< FAILURE! - in org.apache.tez.dag.app.TestSpeculation [ERROR] org.apache.tez.dag.app.TestSpeculation Time elapsed: 0.136 s <<< ERROR! java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V at org.apache.tez.dag.app.TestSpeculation.setupSpeculation(TestSpeculation.java:86) > Shade the guava version used by Tez and move to guava-18 > > > Key: TEZ-2164 > URL: https://issues.apache.org/jira/browse/TEZ-2164 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Hitesh Shah >Priority: Blocker > Attachments: TEZ-2164.3.patch, TEZ-2164.4.patch, > TEZ-2164.wip.2.patch, allow-guava-16.0.1.patch > > > Should allow us to upgrade to a newer version without shipping a guava > dependency. > Would be good to do this in 0.7 so that we stop shipping guava as early as > possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847860#comment-16847860 ] Ahmed Hussein commented on TEZ-4067: An old [TEZ-3934|https://issues.apache.org/jira/browse/TEZ-3934] reported the race condition in the speculator code. When two tasksAttempts are updating their progress simultaneously, the speculator may create two speculative attempts for the same task. The jira was closed after adding two more checks on the hashes to verify that no attempt was speculated while the current thread is busy with the calculation. This does not solve the root problem caused by calling maybeSpeculate() after updating the progress. A proper fix would be to: * The event handler returns after updating the taskAttempt status * A separate thread "speculator" runs periodically to scan the tasks within a vertex to calculate the speculation. Re-implimenting the speculator as-a-service requires the following changes: # add each vertex' speculator to a the list of services in the application master (i.e., DAGAppMaster) # api/DAG needs to support creating vertex speculator as a service. # Test cases (TestSpeculation) may need to be re-written because they were designed for single threaded implementation. > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844220#comment-16844220 ] Ahmed Hussein commented on TEZ-4067: A concurrent Async dispatcher was added in TEZ-1897 . By default the AsyncDispatcher is disabled. In order to enable the concurrentDispatcher, the TezConfiguration needs to pass {noformat} -Dtez.am.use.concurrent-dispatcher=true {noformat} # The AsynDispatcher may not be ideal for production because each Task/TaskAttmept implies notify event on the blocking queue. For status-updates it may be faster to do the update within one thread rather than calling a new event between two threads. # The frequency of events could overwhelm the pool-workers, and events won't be processed on time. # For both synchronous and Asynchronous dispatcher, there is no mechanism to prevent two different workers scanning the vertex tasks. In that case, workers would duplicate the work without any productivity. Suggested fix # Keep the asyncDispatcher disabled. # In legacySpeculator, remove "maybeSpeculate" from "notifyAttemptStatusUpdate()". This will prevent the event handler from executing the main speculation loop. # Create a thread per speculator to execute " maybeSpeculate" every "soonestRetryAfterSpeculate/soonestRetryAfterNoSpeculate" > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
[ https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein reassigned TEZ-4067: -- Assignee: Ahmed Hussein > Tez Speculation decision is calculated on each update by the dispatcher > --- > > Key: TEZ-4067 > URL: https://issues.apache.org/jira/browse/TEZ-4067 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Minor > > LegacySpeculator is an object field in VertexImpl. Therefore, all events are > handled synchronously by the caller (dispatcher). This implies the following: > # the dispatcher spends long time executing updateStatus as it needs to > check the runtime estimation of the tezAttempts within the vertex. > # the speculator is per stage: lunching a speculation may not the optimum > decision. Ideally, based on resources, speculated tasks should be the ones > with slowest progress. > # the time between speculation is skewed because there is a big delay for > the dispatcher to complete a full cycle. Also, speculation will be more > aggressive compared to MR because MR waits for > "soonest.retry.after.speculate" whenever a task is speculated. On the other > hand, Tez speculates more tasks as it processes stages in parallel. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
Ahmed Hussein created TEZ-4067: -- Summary: Tez Speculation decision is calculated on each update by the dispatcher Key: TEZ-4067 URL: https://issues.apache.org/jira/browse/TEZ-4067 Project: Apache Tez Issue Type: Improvement Reporter: Ahmed Hussein LegacySpeculator is an object field in VertexImpl. Therefore, all events are handled synchronously by the caller (dispatcher). This implies the following: # the dispatcher spends long time executing updateStatus as it needs to check the runtime estimation of the tezAttempts within the vertex. # the speculator is per stage: lunching a speculation may not the optimum decision. Ideally, based on resources, speculated tasks should be the ones with slowest progress. # the time between speculation is skewed because there is a big delay for the dispatcher to complete a full cycle. Also, speculation will be more aggressive compared to MR because MR waits for "soonest.retry.after.speculate" whenever a task is speculated. On the other hand, Tez speculates more tasks as it processes stages in parallel. -- This message was sent by Atlassian JIRA (v7.6.3#76005)