[jira] [Commented] (TEZ-4179) [Kubernetes] Extend NodeId in tez to support unique worker identity

2020-05-18 Thread Rajesh Balamohan (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110767#comment-17110767
 ] 

Rajesh Balamohan commented on TEZ-4179:
---

Minor change before commit: Mark the files as final (e.g ExtendedNodeId, 
nodeId, host, port, uniqueId)

> [Kubernetes] Extend NodeId in tez to support unique worker identity
> ---
>
> Key: TEZ-4179
> URL: https://issues.apache.org/jira/browse/TEZ-4179
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Prasanth Jayachandran
>Assignee: Attila Magyar
>Priority: Major
> Attachments: TEZ-4179.1.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In kubernetes environment where pods can have same host name and port, there 
> can be situations where node trackers could be retaining old instance of the 
> pod in its cache. In case of Hive LLAP, where the llap tez task scheduler 
> maintains the membership of nodes based on zookeeper registry events there 
> can be cases where NODE_ADDED followed by NODE_REMOVED event could end up 
> removing the node/host from node trackers because of stable hostname and 
> service port. The NODE_REMOVED event in this case is old stale event of the 
> already dead pod but ZK will send only after session timeout (in case of 
> non-graceful shutdown). If this sequence of events happen, a node/host is 
> completely lost form the schedulers perspective. 
> To support this scenario, tez can extend yarn's NodeId to include 
> uniqueIdentifier. Llap task scheduler can construct the container object with 
> this new NodeId that includes uniqueIdentifier as well so that stale events 
> like above will only remove the host/node that matches the old 
> uniqueIdentifier. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-4183) Time- and threshold-batched FetchFailure event propagation to AM

2020-05-18 Thread Prasanth Jayachandran (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110709#comment-17110709
 ] 

Prasanth Jayachandran commented on TEZ-4183:


Th patch looks good in general. +1 (non-binding). Would be good to have someone 
from tez review the change.

> Time- and threshold-batched FetchFailure event propagation to AM
> 
>
> Key: TEZ-4183
> URL: https://issues.apache.org/jira/browse/TEZ-4183
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Panagiotis Garefalakis
>Assignee: Panagiotis Garefalakis
>Priority: Major
> Attachments: TEZ-4183.01.patch, TEZ-4183.02.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Time based batching can put lot of pressure in AM's memory as the 
> failedEvents hashmap can grow pretty fast 
> https://github.com/apache/tez/blob/354c2a4177fe8c3cf6b8a4c6009d4068a19d81f1/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java#L951
> To reduce AM pressure we can: 1) Batch fetch failure events to be sent 
> periodically (every BATCH_WAIT) and 2) if we see disk errors more than a 
> Threshold send the message immediately to AM (instead of waiting)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-4179) [Kubernetes] Extend NodeId in tez to support unique worker identity

2020-05-18 Thread Rajesh Balamohan (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110697#comment-17110697
 ] 

Rajesh Balamohan commented on TEZ-4179:
---

This would increase the mem footprint in AM side, but would be specific to K8s 
deployment.

[~prasanth_j], [~amagyar] : Can you link the hive ticket that would make use of 
the ExtendedNodeId change for later reference?

LGTM, +1.

> [Kubernetes] Extend NodeId in tez to support unique worker identity
> ---
>
> Key: TEZ-4179
> URL: https://issues.apache.org/jira/browse/TEZ-4179
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Prasanth Jayachandran
>Assignee: Attila Magyar
>Priority: Major
> Attachments: TEZ-4179.1.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In kubernetes environment where pods can have same host name and port, there 
> can be situations where node trackers could be retaining old instance of the 
> pod in its cache. In case of Hive LLAP, where the llap tez task scheduler 
> maintains the membership of nodes based on zookeeper registry events there 
> can be cases where NODE_ADDED followed by NODE_REMOVED event could end up 
> removing the node/host from node trackers because of stable hostname and 
> service port. The NODE_REMOVED event in this case is old stale event of the 
> already dead pod but ZK will send only after session timeout (in case of 
> non-graceful shutdown). If this sequence of events happen, a node/host is 
> completely lost form the schedulers perspective. 
> To support this scenario, tez can extend yarn's NodeId to include 
> uniqueIdentifier. Llap task scheduler can construct the container object with 
> this new NodeId that includes uniqueIdentifier as well so that stale events 
> like above will only remove the host/node that matches the old 
> uniqueIdentifier. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-4178) Create separate method for repeated logic of ensuring spill file permission

2020-05-18 Thread TezQA (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110687#comment-17110687
 ] 

TezQA commented on TEZ-4178:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
32s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
34s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
24s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
28s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
25s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  0m 
55s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
53s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  4m 
22s{color} | {color:green} tez-runtime-library in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
 7s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 14m 25s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.8 Server=19.03.8 base: 
https://builds.apache.org/job/PreCommit-TEZ-Build/440/artifact/out/Dockerfile |
| JIRA Issue | TEZ-4178 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13003327/TEZ-4178.04.patch |
| Optional Tests | dupname asflicense javac javadoc unit spotbugs findbugs 
checkstyle compile |
| uname | Linux e896e38f4689 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 
08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/tez.sh |
| git revision | master / 07c807b |
| Default Java | 1.8.0_252 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-TEZ-Build/440/testReport/ |
| Max. process+thread count | 126 (vs. ulimit of 5500) |
| modules | C: tez-runtime-library U: tez-runtime-library |
| Console output | 
https://builds.apache.org/job/PreCommit-TEZ-Build/440/console |
| versions | git=2.7.4 maven=3.3.9 findbugs=3.0.1 |
| Powered by | Apache Yetus 0.11.1 https://yetus.apache.org |


This message was automatically generated.



> Create separate method for repeated logic of ensuring spill file permission
> ---
>
> Key: TEZ-4178
> URL: https://issues.apache.org/jira/browse/TEZ-4178
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
> Attachments: TEZ-4178.01.patch, TEZ-4178.02.patch, TEZ-4178.03.patch, 
> TEZ-4178.04.patch
>
>




--
This message was sent 

[jira] [Commented] (TEZ-4178) Create separate method for repeated logic of ensuring spill file permission

2020-05-18 Thread Jira


[ 
https://issues.apache.org/jira/browse/TEZ-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110654#comment-17110654
 ] 

László Bodor commented on TEZ-4178:
---

thanks for the comments [~jeagles], [~gopalv]
I've uploaded  [^TEZ-4178.04.patch] for addressing _"it will be better to pass 
in FsPermissions instead of the conf"_

> Create separate method for repeated logic of ensuring spill file permission
> ---
>
> Key: TEZ-4178
> URL: https://issues.apache.org/jira/browse/TEZ-4178
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
> Attachments: TEZ-4178.01.patch, TEZ-4178.02.patch, TEZ-4178.03.patch, 
> TEZ-4178.04.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4178) Create separate method for repeated logic of ensuring spill file permission

2020-05-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/TEZ-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated TEZ-4178:
--
Attachment: TEZ-4178.04.patch

> Create separate method for repeated logic of ensuring spill file permission
> ---
>
> Key: TEZ-4178
> URL: https://issues.apache.org/jira/browse/TEZ-4178
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
> Attachments: TEZ-4178.01.patch, TEZ-4178.02.patch, TEZ-4178.03.patch, 
> TEZ-4178.04.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4178) Create separate method for repeated logic of ensuring spill file permission

2020-05-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/TEZ-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated TEZ-4178:
--
Attachment: (was: TEZ-4178.03.patch)

> Create separate method for repeated logic of ensuring spill file permission
> ---
>
> Key: TEZ-4178
> URL: https://issues.apache.org/jira/browse/TEZ-4178
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
> Attachments: TEZ-4178.01.patch, TEZ-4178.02.patch, TEZ-4178.03.patch, 
> TEZ-4178.04.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4178) Create separate method for repeated logic of ensuring spill file permission

2020-05-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/TEZ-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated TEZ-4178:
--
Attachment: TEZ-4178.03.patch

> Create separate method for repeated logic of ensuring spill file permission
> ---
>
> Key: TEZ-4178
> URL: https://issues.apache.org/jira/browse/TEZ-4178
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
> Attachments: TEZ-4178.01.patch, TEZ-4178.02.patch, TEZ-4178.03.patch, 
> TEZ-4178.04.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4183) Time- and threshold-batched FetchFailure event propagation to AM

2020-05-18 Thread Panagiotis Garefalakis (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Panagiotis Garefalakis updated TEZ-4183:

Description: 
Time based batching can put lot of pressure in AM's memory as the failedEvents 
hashmap can grow pretty fast 
https://github.com/apache/tez/blob/354c2a4177fe8c3cf6b8a4c6009d4068a19d81f1/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java#L951

To reduce AM pressure we can: 1) Batch fetch failure events to be sent 
periodically (every BATCH_WAIT) and 2) if we see disk errors more than a 
Threshold send the message immediately to AM (instead of waiting)

  was:
Time based batching can put lot of pressure in AM's memory as the failedEvents 
hashmap can grow fast 
https://github.com/apache/tez/blob/354c2a4177fe8c3cf6b8a4c6009d4068a19d81f1/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java#L951

To reduce AM pressure we can: 1) Batch fetch failure events to be sent 
periodically (every BATCH_WAIT) and 2) if we see disk errors more than a 
Threshold send the message immediately to AM (instead of waiting)


> Time- and threshold-batched FetchFailure event propagation to AM
> 
>
> Key: TEZ-4183
> URL: https://issues.apache.org/jira/browse/TEZ-4183
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Panagiotis Garefalakis
>Assignee: Panagiotis Garefalakis
>Priority: Major
> Attachments: TEZ-4183.01.patch, TEZ-4183.02.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Time based batching can put lot of pressure in AM's memory as the 
> failedEvents hashmap can grow pretty fast 
> https://github.com/apache/tez/blob/354c2a4177fe8c3cf6b8a4c6009d4068a19d81f1/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java#L951
> To reduce AM pressure we can: 1) Batch fetch failure events to be sent 
> periodically (every BATCH_WAIT) and 2) if we see disk errors more than a 
> Threshold send the message immediately to AM (instead of waiting)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4183) Time- and threshold-batched FetchFailure event propagation to AM

2020-05-18 Thread Panagiotis Garefalakis (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Panagiotis Garefalakis updated TEZ-4183:

Description: 
Time based batching can put lot of pressure in AM's memory as the failedEvents 
hashmap can grow fast 
https://github.com/apache/tez/blob/354c2a4177fe8c3cf6b8a4c6009d4068a19d81f1/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java#L951

To reduce AM pressure we can: 1) Batch fetch failure events to be sent 
periodically (every BATCH_WAIT) and 2) if we see disk errors more than a 
Threshold send the message immediately to AM (instead of waiting)

  was:
Fetcher currently sends failure events to AM as soon as they are discovered:
https://github.com/apache/tez/blob/354c2a4177fe8c3cf6b8a4c6009d4068a19d81f1/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java#L930

To reduce AM pressure we can: 1) Batch fetch failure events to be sent 
periodically (every BATCH_WAIT) and 2) if we see disk errors more than a 
Threshold send the message immediately to AM (instead of waiting)


> Time- and threshold-batched FetchFailure event propagation to AM
> 
>
> Key: TEZ-4183
> URL: https://issues.apache.org/jira/browse/TEZ-4183
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Panagiotis Garefalakis
>Assignee: Panagiotis Garefalakis
>Priority: Major
> Attachments: TEZ-4183.01.patch, TEZ-4183.02.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Time based batching can put lot of pressure in AM's memory as the 
> failedEvents hashmap can grow fast 
> https://github.com/apache/tez/blob/354c2a4177fe8c3cf6b8a4c6009d4068a19d81f1/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java#L951
> To reduce AM pressure we can: 1) Batch fetch failure events to be sent 
> periodically (every BATCH_WAIT) and 2) if we see disk errors more than a 
> Threshold send the message immediately to AM (instead of waiting)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-4183) Time- and threshold-batched FetchFailure event propagation to AM

2020-05-18 Thread TezQA (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110492#comment-17110492
 ] 

TezQA commented on TEZ-4183:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
34s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
22s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
 4s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
56s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
54s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
1s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  0m 
55s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
50s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m  
9s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 12s{color} | {color:orange} tez-api: The patch generated 1 new + 0 unchanged 
- 0 fixed = 1 total (was 0) {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 14s{color} | {color:orange} tez-runtime-library: The patch generated 3 new + 
79 unchanged - 9 fixed = 82 total (was 88) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
42s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
48s{color} | {color:green} tez-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  4m 
46s{color} | {color:green} tez-runtime-library in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
17s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 21m 39s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.8 Server=19.03.8 base: 
https://builds.apache.org/job/PreCommit-TEZ-Build/439/artifact/out/Dockerfile |
| JIRA Issue | TEZ-4183 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13003304/TEZ-4183.02.patch |
| Optional Tests | dupname asflicense javac javadoc unit spotbugs findbugs 
checkstyle compile |
| uname | Linux aecbe8cdaf22 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/tez.sh |
| git revision | master / 07c807b |
| Default Java | 1.8.0_252 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-TEZ-Build/439/artifact/out/diff-checkstyle-tez-api.txt
 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-TEZ-Build/439/artifact/out/diff-checkstyle-tez-runtime-library.txt
 |
|  Test Results | 

[jira] [Updated] (TEZ-4183) Time- and threshold-batched FetchFailure event propagation to AM

2020-05-18 Thread Panagiotis Garefalakis (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Panagiotis Garefalakis updated TEZ-4183:

Attachment: TEZ-4183.02.patch

> Time- and threshold-batched FetchFailure event propagation to AM
> 
>
> Key: TEZ-4183
> URL: https://issues.apache.org/jira/browse/TEZ-4183
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Panagiotis Garefalakis
>Assignee: Panagiotis Garefalakis
>Priority: Major
> Attachments: TEZ-4183.01.patch, TEZ-4183.02.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Fetcher currently sends failure events to AM as soon as they are discovered:
> https://github.com/apache/tez/blob/354c2a4177fe8c3cf6b8a4c6009d4068a19d81f1/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java#L930
> To reduce AM pressure we can: 1) Batch fetch failure events to be sent 
> periodically (every BATCH_WAIT) and 2) if we see disk errors more than a 
> Threshold send the message immediately to AM (instead of waiting)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-4183) Time- and threshold-batched FetchFailure event propagation to AM

2020-05-18 Thread TezQA (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110459#comment-17110459
 ] 

TezQA commented on TEZ-4183:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
34s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
21s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  3m 
58s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
55s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
51s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
1s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  0m 
55s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
51s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m  
8s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 12s{color} | {color:orange} tez-api: The patch generated 1 new + 0 unchanged 
- 0 fixed = 1 total (was 0) {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 15s{color} | {color:orange} tez-runtime-library: The patch generated 3 new + 
79 unchanged - 9 fixed = 82 total (was 88) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
39s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
43s{color} | {color:green} tez-api in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  4m 48s{color} 
| {color:red} tez-runtime-library in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
16s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 21m 21s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | tez.runtime.library.common.shuffle.TestFetcher |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.8 Server=19.03.8 base: 
https://builds.apache.org/job/PreCommit-TEZ-Build/438/artifact/out/Dockerfile |
| JIRA Issue | TEZ-4183 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13003297/TEZ-4183.01.patch |
| Optional Tests | dupname asflicense javac javadoc unit spotbugs findbugs 
checkstyle compile |
| uname | Linux bf0530081cef 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/tez.sh |
| git revision | master / 07c807b |
| Default Java | 1.8.0_252 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-TEZ-Build/438/artifact/out/diff-checkstyle-tez-api.txt
 |
| checkstyle | 

[jira] [Updated] (TEZ-4183) Time- and threshold-batched FetchFailure event propagation to AM

2020-05-18 Thread Panagiotis Garefalakis (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Panagiotis Garefalakis updated TEZ-4183:

Attachment: TEZ-4183.01.patch

> Time- and threshold-batched FetchFailure event propagation to AM
> 
>
> Key: TEZ-4183
> URL: https://issues.apache.org/jira/browse/TEZ-4183
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Panagiotis Garefalakis
>Assignee: Panagiotis Garefalakis
>Priority: Major
> Attachments: TEZ-4183.01.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Fetcher currently sends failure events to AM as soon as they are discovered:
> https://github.com/apache/tez/blob/354c2a4177fe8c3cf6b8a4c6009d4068a19d81f1/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java#L930
> To reduce AM pressure we can: 1) Batch fetch failure events to be sent 
> periodically (every BATCH_WAIT) and 2) if we see disk errors more than a 
> Threshold send the message immediately to AM (instead of waiting)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (TEZ-4183) Time- and threshold-batched FetchFailure event propagation to AM

2020-05-18 Thread Ashutosh Chauhan (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reassigned TEZ-4183:
-

Assignee: Panagiotis Garefalakis

> Time- and threshold-batched FetchFailure event propagation to AM
> 
>
> Key: TEZ-4183
> URL: https://issues.apache.org/jira/browse/TEZ-4183
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Panagiotis Garefalakis
>Assignee: Panagiotis Garefalakis
>Priority: Major
> Attachments: TEZ-4183.01.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Fetcher currently sends failure events to AM as soon as they are discovered:
> https://github.com/apache/tez/blob/354c2a4177fe8c3cf6b8a4c6009d4068a19d81f1/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java#L930
> To reduce AM pressure we can: 1) Batch fetch failure events to be sent 
> periodically (every BATCH_WAIT) and 2) if we see disk errors more than a 
> Threshold send the message immediately to AM (instead of waiting)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TEZ-4183) Time- and threshold-batched FetchFailure event propagation to AM

2020-05-18 Thread Panagiotis Garefalakis (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110290#comment-17110290
 ] 

Panagiotis Garefalakis edited comment on TEZ-4183 at 5/18/20, 3:46 PM:
---

Hey [~jeagles] thanks for the extra details, found them pretty useful.
I created a patch for the unordered Fetcher that can now keep track diskRead 
errors (similar to the unordered one) and makes use of both time- and 
threshold-base batching.

In more detail, the AM from that Fetcher is informed:
* Immediately if maxTimeToWaitForReportMillis is 0 (similar to 
reportReadErrorImmediately in unordered Fetcher)
* When time exceeded SHUFFLE_BATCH_WAIT ms (batch events)
* When more than THRESHOLD readErrors occurred for a particular task_attempt -- 
5 maxFetchFailuresBeforeReporting by default (batch events)

Thoughts here? cc: [~abstractdog] [~prasanth_j]


was (Author: pgaref):
Hey [~jeagles] thanks for the extra details, found them pretty useful.
I created a patch for the unordered Fetcher that can now keep track diskRead 
errors (similar to the unordered one) and makes use of both time- and 
threshold-base batching.

In more detail, the AM from that Fetcher is informed:
* Immediately if maxTimeToWaitForReportMillis is 0 (similar to 
reportReadErrorImmediately in unordered Fetcher)
* When time exceeded SHUFFLE_BATCH_WAIT ms (batch events)
* When more than THRESHOLD readErrors occurred for a particular task_attempt -- 
5 maxFetchFailuresBeforeReporting by default (batch events)

Thoughts here? cc: [~abstractdog]

> Time- and threshold-batched FetchFailure event propagation to AM
> 
>
> Key: TEZ-4183
> URL: https://issues.apache.org/jira/browse/TEZ-4183
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Panagiotis Garefalakis
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Fetcher currently sends failure events to AM as soon as they are discovered:
> https://github.com/apache/tez/blob/354c2a4177fe8c3cf6b8a4c6009d4068a19d81f1/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java#L930
> To reduce AM pressure we can: 1) Batch fetch failure events to be sent 
> periodically (every BATCH_WAIT) and 2) if we see disk errors more than a 
> Threshold send the message immediately to AM (instead of waiting)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-4183) Time- and threshold-batched FetchFailure event propagation to AM

2020-05-18 Thread Panagiotis Garefalakis (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110290#comment-17110290
 ] 

Panagiotis Garefalakis commented on TEZ-4183:
-

Hey [~jeagles] thanks for the extra details, found them pretty useful.
I created a patch for the unordered Fetcher that can now keep track diskRead 
errors (similar to the unordered one) and makes use of both time- and 
threshold-base batching.

In more detail, the AM from that Fetcher is informed:
* Immediately if maxTimeToWaitForReportMillis is 0 (similar to 
reportReadErrorImmediately in unordered Fetcher)
* When time exceeded SHUFFLE_BATCH_WAIT ms (batch events)
* When more than THRESHOLD readErrors occurred for a particular task_attempt -- 
5 maxFetchFailuresBeforeReporting by default (batch events)

Thoughts here? cc: [~abstractdog]

> Time- and threshold-batched FetchFailure event propagation to AM
> 
>
> Key: TEZ-4183
> URL: https://issues.apache.org/jira/browse/TEZ-4183
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Panagiotis Garefalakis
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Fetcher currently sends failure events to AM as soon as they are discovered:
> https://github.com/apache/tez/blob/354c2a4177fe8c3cf6b8a4c6009d4068a19d81f1/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java#L930
> To reduce AM pressure we can: 1) Batch fetch failure events to be sent 
> periodically (every BATCH_WAIT) and 2) if we see disk errors more than a 
> Threshold send the message immediately to AM (instead of waiting)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-2103) Implement a Partial completion VertexManagerPlugin

2020-05-18 Thread Syed Shameerur Rahman (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110149#comment-17110149
 ] 

Syed Shameerur Rahman commented on TEZ-2103:


 [~gopalv] [~jeagles] Could you please review the design decisions in 
*TEZ-2103.WIP.patch*. I am not sure if this is the right approach to solve the 
problem but here is the summary.

*Tez Side:*

1) On *V_TASK_COMPLETED* event, check if the task is succeeded if yes, then 
calculate the no. of records returned from the task counters (*groupName*: 
HIVE, *counterName*: RECORDS_OUT_0)

2) Check if the maxRow limit have been exceeded, if yes for all the remaining 
tasks send TaskEventTermination event. (marking the state of tasks/task 
attempts as killed due to short circuit makes more sense than marking it as 
SUCCEEDED).

3) Finally mark the vertex as SUCCEEDED


*Hive Side:*

1) We do have a class in hive *GlobalLimitOptimizer* which can detect simple 
select queries with where clause and limit and extract the defined limit from 
such queries.

2) Pass the extracted limit from hive as dagConf, which can be used to set the 
limit's value in tez.

I am not sure about the Pig's use case.

> Implement a Partial completion VertexManagerPlugin
> --
>
> Key: TEZ-2103
> URL: https://issues.apache.org/jira/browse/TEZ-2103
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Gopal Vijayaraghavan
>Priority: Major
>  Labels: gsoc, gsoc2015, hadoop, java, tez
> Attachments: TEZ-2103.WIP.patch
>
>
> Currently, there is no sibling communication between tasks - this implies 
> that a task can be completed by the first vertex in a wave of tasks, but the 
> entire wave of tasks has to complete before success can be reported.
> This occurs in limit + filter query patterns common between the data access 
> engines.
> {code}
> select * from data where x > 1 limit 10;
> {code}
> will run through a full-table scan worth of tasks to generate 10 rows per 
> task, to aggregate it to produce the final 10 row result.
> The VertexManager receives counters/events early enough to short-circuit the 
> rest of the vertex tasks, to prevent the remainder of tasks from getting 
> scheduled when the limit condition has been satisfied by an initial sub-set 
> of the tasks.
> This is a specialization of the VertexManagerPlugin for this common case 
> scheduling pattern.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-2103) Implement a Partial completion VertexManagerPlugin

2020-05-18 Thread Syed Shameerur Rahman (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Syed Shameerur Rahman updated TEZ-2103:
---
Attachment: TEZ-2103.WIP.patch

> Implement a Partial completion VertexManagerPlugin
> --
>
> Key: TEZ-2103
> URL: https://issues.apache.org/jira/browse/TEZ-2103
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Gopal Vijayaraghavan
>Priority: Major
>  Labels: gsoc, gsoc2015, hadoop, java, tez
> Attachments: TEZ-2103.WIP.patch
>
>
> Currently, there is no sibling communication between tasks - this implies 
> that a task can be completed by the first vertex in a wave of tasks, but the 
> entire wave of tasks has to complete before success can be reported.
> This occurs in limit + filter query patterns common between the data access 
> engines.
> {code}
> select * from data where x > 1 limit 10;
> {code}
> will run through a full-table scan worth of tasks to generate 10 rows per 
> task, to aggregate it to produce the final 10 row result.
> The VertexManager receives counters/events early enough to short-circuit the 
> rest of the vertex tasks, to prevent the remainder of tasks from getting 
> scheduled when the limit condition has been satisfied by an initial sub-set 
> of the tasks.
> This is a specialization of the VertexManagerPlugin for this common case 
> scheduling pattern.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-4179) [Kubernetes] Extend NodeId in tez to support unique worker identity

2020-05-18 Thread Attila Magyar (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110032#comment-17110032
 ] 

Attila Magyar commented on TEZ-4179:


[~rajesh.balamohan], tests passed, can you please review it. 

 

cc: [~prasanth_j]

> [Kubernetes] Extend NodeId in tez to support unique worker identity
> ---
>
> Key: TEZ-4179
> URL: https://issues.apache.org/jira/browse/TEZ-4179
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Prasanth Jayachandran
>Assignee: Attila Magyar
>Priority: Major
> Attachments: TEZ-4179.1.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In kubernetes environment where pods can have same host name and port, there 
> can be situations where node trackers could be retaining old instance of the 
> pod in its cache. In case of Hive LLAP, where the llap tez task scheduler 
> maintains the membership of nodes based on zookeeper registry events there 
> can be cases where NODE_ADDED followed by NODE_REMOVED event could end up 
> removing the node/host from node trackers because of stable hostname and 
> service port. The NODE_REMOVED event in this case is old stale event of the 
> already dead pod but ZK will send only after session timeout (in case of 
> non-graceful shutdown). If this sequence of events happen, a node/host is 
> completely lost form the schedulers perspective. 
> To support this scenario, tez can extend yarn's NodeId to include 
> uniqueIdentifier. Llap task scheduler can construct the container object with 
> this new NodeId that includes uniqueIdentifier as well so that stale events 
> like above will only remove the host/node that matches the old 
> uniqueIdentifier. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TEZ-4179) [Kubernetes] Extend NodeId in tez to support unique worker identity

2020-05-18 Thread Attila Magyar (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Magyar updated TEZ-4179:
---
Attachment: (was: TEZ-4179.1.patch)

> [Kubernetes] Extend NodeId in tez to support unique worker identity
> ---
>
> Key: TEZ-4179
> URL: https://issues.apache.org/jira/browse/TEZ-4179
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Prasanth Jayachandran
>Assignee: Attila Magyar
>Priority: Major
> Attachments: TEZ-4179.1.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In kubernetes environment where pods can have same host name and port, there 
> can be situations where node trackers could be retaining old instance of the 
> pod in its cache. In case of Hive LLAP, where the llap tez task scheduler 
> maintains the membership of nodes based on zookeeper registry events there 
> can be cases where NODE_ADDED followed by NODE_REMOVED event could end up 
> removing the node/host from node trackers because of stable hostname and 
> service port. The NODE_REMOVED event in this case is old stale event of the 
> already dead pod but ZK will send only after session timeout (in case of 
> non-graceful shutdown). If this sequence of events happen, a node/host is 
> completely lost form the schedulers perspective. 
> To support this scenario, tez can extend yarn's NodeId to include 
> uniqueIdentifier. Llap task scheduler can construct the container object with 
> this new NodeId that includes uniqueIdentifier as well so that stale events 
> like above will only remove the host/node that matches the old 
> uniqueIdentifier. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TEZ-4179) [Kubernetes] Extend NodeId in tez to support unique worker identity

2020-05-18 Thread TezQA (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110024#comment-17110024
 ] 

TezQA commented on TEZ-4179:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  9m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
23s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
28s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
26s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
11s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
10s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 17s{color} | {color:orange} tez-dag: The patch generated 2 new + 11 
unchanged - 1 fixed = 13 total (was 12) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
7s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
48s{color} | {color:green} tez-dag in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
 8s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 23m 23s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.8 Server=19.03.8 base: 
https://builds.apache.org/job/PreCommit-TEZ-Build/437/artifact/out/Dockerfile |
| JIRA Issue | TEZ-4179 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13003250/TEZ-4179.1.patch |
| Optional Tests | dupname asflicense javac javadoc unit spotbugs findbugs 
checkstyle compile |
| uname | Linux 367ab303b05b 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 
08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/tez.sh |
| git revision | master / 07c807b |
| Default Java | 1.8.0_252 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-TEZ-Build/437/artifact/out/diff-checkstyle-tez-dag.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-TEZ-Build/437/testReport/ |
| Max. process+thread count | 190 (vs. ulimit of 5500) |
| modules | C: tez-dag U: tez-dag |
| Console output | 
https://builds.apache.org/job/PreCommit-TEZ-Build/437/console |
| versions | git=2.7.4 maven=3.3.9 findbugs=3.0.1 |
| Powered by | Apache Yetus 0.11.1 https://yetus.apache.org |


This message was automatically generated.



> [Kubernetes] Extend NodeId in tez to support unique worker identity
> ---
>
> Key: TEZ-4179
> URL: https://issues.apache.org/jira/browse/TEZ-4179
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Prasanth Jayachandran
>Assignee: Attila Magyar
>Priority: Major
>   

[jira] [Updated] (TEZ-4179) [Kubernetes] Extend NodeId in tez to support unique worker identity

2020-05-18 Thread Attila Magyar (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Magyar updated TEZ-4179:
---
Attachment: TEZ-4179.1.patch

> [Kubernetes] Extend NodeId in tez to support unique worker identity
> ---
>
> Key: TEZ-4179
> URL: https://issues.apache.org/jira/browse/TEZ-4179
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Prasanth Jayachandran
>Assignee: Attila Magyar
>Priority: Major
> Attachments: TEZ-4179.1.patch, TEZ-4179.1.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In kubernetes environment where pods can have same host name and port, there 
> can be situations where node trackers could be retaining old instance of the 
> pod in its cache. In case of Hive LLAP, where the llap tez task scheduler 
> maintains the membership of nodes based on zookeeper registry events there 
> can be cases where NODE_ADDED followed by NODE_REMOVED event could end up 
> removing the node/host from node trackers because of stable hostname and 
> service port. The NODE_REMOVED event in this case is old stale event of the 
> already dead pod but ZK will send only after session timeout (in case of 
> non-graceful shutdown). If this sequence of events happen, a node/host is 
> completely lost form the schedulers perspective. 
> To support this scenario, tez can extend yarn's NodeId to include 
> uniqueIdentifier. Llap task scheduler can construct the container object with 
> this new NodeId that includes uniqueIdentifier as well so that stale events 
> like above will only remove the host/node that matches the old 
> uniqueIdentifier. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (TEZ-4179) [Kubernetes] Extend NodeId in tez to support unique worker identity

2020-05-18 Thread Attila Magyar (Jira)


 [ 
https://issues.apache.org/jira/browse/TEZ-4179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Magyar reassigned TEZ-4179:
--

Assignee: Attila Magyar  (was: Prasanth Jayachandran)

> [Kubernetes] Extend NodeId in tez to support unique worker identity
> ---
>
> Key: TEZ-4179
> URL: https://issues.apache.org/jira/browse/TEZ-4179
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Prasanth Jayachandran
>Assignee: Attila Magyar
>Priority: Major
> Attachments: TEZ-4179.1.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In kubernetes environment where pods can have same host name and port, there 
> can be situations where node trackers could be retaining old instance of the 
> pod in its cache. In case of Hive LLAP, where the llap tez task scheduler 
> maintains the membership of nodes based on zookeeper registry events there 
> can be cases where NODE_ADDED followed by NODE_REMOVED event could end up 
> removing the node/host from node trackers because of stable hostname and 
> service port. The NODE_REMOVED event in this case is old stale event of the 
> already dead pod but ZK will send only after session timeout (in case of 
> non-graceful shutdown). If this sequence of events happen, a node/host is 
> completely lost form the schedulers perspective. 
> To support this scenario, tez can extend yarn's NodeId to include 
> uniqueIdentifier. Llap task scheduler can construct the container object with 
> this new NodeId that includes uniqueIdentifier as well so that stale events 
> like above will only remove the host/node that matches the old 
> uniqueIdentifier. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)