[jira] [Commented] (TEZ-1961) Remove misleading exception No running dag from AM logs
[ https://issues.apache.org/jira/browse/TEZ-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532138#comment-14532138 ] TezQA commented on TEZ-1961: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12731085/TEZ-1961-3.patch against master revision 02870f0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/649//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/649//console This message is automatically generated. Remove misleading exception No running dag from AM logs - Key: TEZ-1961 URL: https://issues.apache.org/jira/browse/TEZ-1961 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Jeff Zhang Priority: Critical Attachments: TEZ-1961-1.patch, TEZ-1961-2.patch, TEZ-1961-3.patch {code} 15/01/14 16:45:06 INFO ipc.Server: IPC Server handler 0 on 51000, call org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus from Call#0 Retry#0 org.apache.tez.dag.api.TezException: No running dag at present at org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:84) at org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:151) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:94) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7375) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2041) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2037) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2035) 15/01/14 16:45:06 INFO client.DAGClientImpl: DAG initialized: CurrentState=Running {code} This exception shows up fairly often and isn't very relevant - queries before a DAG is submitted to the AM. This is very misleading, especially for folks new to Tez, and should be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: TEZ-1961 PreCommit Build #649
Jira: https://issues.apache.org/jira/browse/TEZ-1961 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/649/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2850 lines...] [INFO] Final Memory: 70M/931M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12731085/TEZ-1961-3.patch against master revision 02870f0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/649//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/649//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 2a8b86df1ccfb4cd7e51a1a513e609b74e98353f logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #646 Archived 44 artifacts Archive block size is 32768 Received 2 blocks and 2706810 bytes Compression is 2.4% Took 1.1 sec Description set: TEZ-1961 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event
[ https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533094#comment-14533094 ] Siddharth Seth commented on TEZ-2426: - [~bikassaha] - do you have additional logs - the entire AM log specifically. There seems to be a discrepancy in the AM / task log times as well. Assuming the nodes are out of sync. I can see how the exception happens during execution of the next task - since we don't join on the eventRouter thread. However, I'm not sure how the FAILED message will go through for the previous attempt as a result of this. It should have gone through for the currently running task. If it went for the previous task - the AM should have thrown an error related to an invalid taskAttemptId. That leads me to believe something else is broken at the same time. Task input not complete before sending Task completed event --- Key: TEZ-2426 URL: https://issues.apache.org/jira/browse/TEZ-2426 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Priority: Critical Attachments: am.log, container.log Sequence of events 1) Task A starts in a container 2) Task A complete event comes to AM 3) Task B starts in the same container 4) Task A's input calls some method on its context. Crashes with NPE 5) The crash sends an input failed event for Task A to the AM 6) Task A state machine crashes saying cannot handle failed after success In some cases, it could be that status update event is also sent after completion, though not sure if its related to the failed event being sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event
[ https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533096#comment-14533096 ] Siddharth Seth commented on TEZ-2426: - The status update event after the task failed is also strange. Will look into that. The thread for the last running task may not be exiting properly. Task input not complete before sending Task completed event --- Key: TEZ-2426 URL: https://issues.apache.org/jira/browse/TEZ-2426 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Priority: Critical Attachments: am.log, container.log Sequence of events 1) Task A starts in a container 2) Task A complete event comes to AM 3) Task B starts in the same container 4) Task A's input calls some method on its context. Crashes with NPE 5) The crash sends an input failed event for Task A to the AM 6) Task A state machine crashes saying cannot handle failed after success In some cases, it could be that status update event is also sent after completion, though not sure if its related to the failed event being sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2418) TASK_ATTEMPT_FAILED_EVENT and TASK_COMPLETED_EVENT should move back to direct routing to attempt
[ https://issues.apache.org/jira/browse/TEZ-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-2418: Description: Due to recovery code path, they are currently double routed to the vertex first and then the attempt. TASK_ATTEMPT_FAILED_EVENT and TASK_COMPLETED_EVENT should move back to direct routing to attempt Key: TEZ-2418 URL: https://issues.apache.org/jira/browse/TEZ-2418 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-2418.1.patch Due to recovery code path, they are currently double routed to the vertex first and then the attempt. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-776) Reduce AM mem usage caused by storing TezEvents
[ https://issues.apache.org/jira/browse/TEZ-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533177#comment-14533177 ] Hitesh Shah commented on TEZ-776: - +1 on the patch. Though as [~sseth] pointed out, there are potential concerns around BroadcastEdgeManager thread safety. From a practical point of view, it likely should not be hit as the prepare function is invoked long before the edge is used and the rpc threads will likely not have looked up the event route metadata object before this point. Theoretically, there is a possibility of visibility issues given that there is no lock on any function inside BroadcastEdgeManager ( and the happens-before semantics would not kick in ). Reduce AM mem usage caused by storing TezEvents --- Key: TEZ-776 URL: https://issues.apache.org/jira/browse/TEZ-776 Project: Apache Tez Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Bikas Saha Priority: Blocker Attachments: TEZ-776.1.patch, TEZ-776.10.patch, TEZ-776.11.patch, TEZ-776.12.patch, TEZ-776.13.patch, TEZ-776.14.patch, TEZ-776.2.patch, TEZ-776.3.patch, TEZ-776.4.patch, TEZ-776.5.patch, TEZ-776.6.A.patch, TEZ-776.6.B.patch, TEZ-776.7.patch, TEZ-776.8.patch, TEZ-776.9.patch, TEZ-776.ondemand.1.patch, TEZ-776.ondemand.2.patch, TEZ-776.ondemand.3.patch, TEZ-776.ondemand.4.patch, TEZ-776.ondemand.5.patch, TEZ-776.ondemand.6.patch, TEZ-776.ondemand.7.patch, TEZ-776.ondemand.patch, With_Patch_AM_hotspots.png, With_Patch_AM_profile.png, Without_patch_AM_CPU_Usage.png, events-problem-solutions.txt, with_patch_jmc_output_of_AM.png, without_patch_jmc_output_of_AM.png This is open ended at the moment. A fair chunk of the AM heap is taken up by TezEvents (specifically DataMovementEvents - 64 bytes per event). Depending on the connection pattern - this puts limits on the number of tasks that can be processed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1961) Remove misleading exception No running dag from AM logs
[ https://issues.apache.org/jira/browse/TEZ-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533202#comment-14533202 ] Bikas Saha commented on TEZ-1961: - bq. Previously DAGClientAMProtocol#getAMStatus is not supported for non-session mode That was because it returns session status, which makes no sense in non-session mode. Please dont change that. Remove misleading exception No running dag from AM logs - Key: TEZ-1961 URL: https://issues.apache.org/jira/browse/TEZ-1961 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Jeff Zhang Priority: Critical Attachments: TEZ-1961-1.patch, TEZ-1961-2.patch, TEZ-1961-3.patch {code} 15/01/14 16:45:06 INFO ipc.Server: IPC Server handler 0 on 51000, call org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus from Call#0 Retry#0 org.apache.tez.dag.api.TezException: No running dag at present at org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:84) at org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:151) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:94) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7375) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2041) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2037) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2035) 15/01/14 16:45:06 INFO client.DAGClientImpl: DAG initialized: CurrentState=Running {code} This exception shows up fairly often and isn't very relevant - queries before a DAG is submitted to the AM. This is very misleading, especially for folks new to Tez, and should be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-776) Reduce AM mem usage caused by storing TezEvents
[ https://issues.apache.org/jira/browse/TEZ-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533129#comment-14533129 ] Bikas Saha commented on TEZ-776: [~hitesh] [~rajesh.balamohan] Any further comments? Reduce AM mem usage caused by storing TezEvents --- Key: TEZ-776 URL: https://issues.apache.org/jira/browse/TEZ-776 Project: Apache Tez Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Bikas Saha Priority: Blocker Attachments: TEZ-776.1.patch, TEZ-776.10.patch, TEZ-776.11.patch, TEZ-776.12.patch, TEZ-776.13.patch, TEZ-776.14.patch, TEZ-776.2.patch, TEZ-776.3.patch, TEZ-776.4.patch, TEZ-776.5.patch, TEZ-776.6.A.patch, TEZ-776.6.B.patch, TEZ-776.7.patch, TEZ-776.8.patch, TEZ-776.9.patch, TEZ-776.ondemand.1.patch, TEZ-776.ondemand.2.patch, TEZ-776.ondemand.3.patch, TEZ-776.ondemand.4.patch, TEZ-776.ondemand.5.patch, TEZ-776.ondemand.6.patch, TEZ-776.ondemand.7.patch, TEZ-776.ondemand.patch, With_Patch_AM_hotspots.png, With_Patch_AM_profile.png, Without_patch_AM_CPU_Usage.png, events-problem-solutions.txt, with_patch_jmc_output_of_AM.png, without_patch_jmc_output_of_AM.png This is open ended at the moment. A fair chunk of the AM heap is taken up by TezEvents (specifically DataMovementEvents - 64 bytes per event). Depending on the connection pattern - this puts limits on the number of tasks that can be processed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2410) VertexGroupCommitFinishedEvent VertexCommitStartedEvent is not logged correctly
[ https://issues.apache.org/jira/browse/TEZ-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533146#comment-14533146 ] Bikas Saha commented on TEZ-2410: - Shouldnt vertexGroup.isCommitted=true be replaced by vertexGroup.commitStarted=true ? This is when the commit process starts, right? Without this vertexGroup.isInCommitting() will return false. Not sure how tests are passing with this. {code} for (final VertexGroupInfo groupInfo : vertexGroups.values()) { if (!groupInfo.outputs.isEmpty()) { -groupInfo.committed = true; final Vertex v = getVertex(groupInfo.groupMembers.iterator().next()); for (final String outputName : groupInfo.outputs) { final OutputKey outputKey = new OutputKey(outputName, groupInfo.groupName, true); @@ -1920,7 +1931,6 @@ public class DAGImpl implements org.apache.tez.dag.app.dag.DAG, + data, groupName= + groupInfo.groupName); continue; } - groupInfo.committed = true;{code} This is probably not going to work with the above code {code} // partial output may already have been in committing or committed. fail if so ListVertexGroupInfo groupList = vertexGroupInfo.get(vertex.getName()); if (groupList != null) { for (VertexGroupInfo groupInfo : groupList) { if (groupInfo.isInCommitting()) { String msg = Aborting job as committing vertex: + vertex.getLogIdentifier() + is re-running; LOG.info(msg); addDiagnostic(msg); enactKill(DAGTerminationCause.VERTEX_RERUN_IN_COMMITTING, VertexTerminationCause.VERTEX_RERUN_IN_COMMITTING); return true; } else if (groupInfo.isCommitted()) {{code} 1) succeededCommits looks unused - we could remove it 2) Why is vertexGroup.commitStarted=true here? this is where commit finishes, right? 3) if condition can be replaced by vertexGroup.isCommitted(); 4) unnecessary space before ++ 5) missing { after if stmt {code} + OutputKey outputKey = commitCompletedEvent.getOutputKey(); + succeededCommits.add(outputKey); unused + if (outputKey.isVertexGroupOutput){ +VertexGroupInfo vertexGroup = vertexGroups.get(outputKey.getEntityName()); +vertexGroup.commitStarted = true; why here at finish time? +vertexGroup.successfulCommits ++; space +if (vertexGroup.successfulCommits == vertexGroup.outputs.size()) { replace with isCommitted() + if (!commitAllOutputsOnSuccess) missing { + try { {code} Which test case is covered the VertexImpl change? testVertexCommit_OnVertexSuccess()? Which test/check is covering that vertexgroupcommit event is not written for a non-group vertex when all commits happen on dag success? Rename testVertexSucceed_OnDAGSuccess() to testVertexCommit_OnDAGSuccess()? VertexGroupCommitFinishedEvent VertexCommitStartedEvent is not logged correctly - Key: TEZ-2410 URL: https://issues.apache.org/jira/browse/TEZ-2410 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Priority: Blocker Attachments: TEZ-2410-1.patch, TEZ-2410-1.patch, TEZ-2410-2.patch VertexGroupCommitFinishedEvent may be logged for non-vertex group commits. VertexGroupCommitFinishedEvent may be logged for each member vertex of the group instead of once per group. VertexCommitStartedEvent may be logged for each output of vertex -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2418) TASK_ATTEMPT_FAILED_EVENT and TASK_COMPLETED_EVENT should move back to direct routing to attempt
[ https://issues.apache.org/jira/browse/TEZ-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-2418: Summary: TASK_ATTEMPT_FAILED_EVENT and TASK_COMPLETED_EVENT should move back to direct routing to attempt (was: TASK_ATTEMPT_FAILED_EVENT missed in TEZ-2325) TASK_ATTEMPT_FAILED_EVENT and TASK_COMPLETED_EVENT should move back to direct routing to attempt Key: TEZ-2418 URL: https://issues.apache.org/jira/browse/TEZ-2418 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-2418.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2430) Add test for RecoveryEvent Spec
[ https://issues.apache.org/jira/browse/TEZ-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-2430: Issue Type: Sub-task (was: Improvement) Parent: TEZ-15 Add test for RecoveryEvent Spec --- Key: TEZ-2430 URL: https://issues.apache.org/jira/browse/TEZ-2430 Project: Apache Tez Issue Type: Sub-task Reporter: Jeff Zhang Assignee: Jeff Zhang * Ordering of RecoveryEvents. ** DataMovementEvent must be logged before TaskAttemptFinishedEvent ** InputDataInfoEvent must be logged before VertexInitializedEvent (already covered in the existing test) * Frequency of RecoveryEvent. e.g. TaskAttemptStartedEvent can only been logged once, but TaskAttemptFinishedEvent can been logged twice. (TaskAttempt transit from SUCCEEDED to FAILED) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2431) Recovery of task events (eg. datamovement events) should not depend on ordering of task attempt events
Bikas Saha created TEZ-2431: --- Summary: Recovery of task events (eg. datamovement events) should not depend on ordering of task attempt events Key: TEZ-2431 URL: https://issues.apache.org/jira/browse/TEZ-2431 Project: Apache Tez Issue Type: Sub-task Reporter: Bikas Saha Today, task attempt events need to go through verteximpl before reaching the task in order to maintain ordering guarantees for recovery. This causes these events to be routed twice through the dispatcher. This can cause overhead delays in large jobs. Also, this makes assumptions about event ordering which make the system fragile. Recovery should work independently of other system interactions so that evolution of other components is not affected by recovery unless it affects recovery logically. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2418) TASK_ATTEMPT_FAILED_EVENT missed in TEZ-2325
[ https://issues.apache.org/jira/browse/TEZ-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-2418: Priority: Major (was: Blocker) TASK_ATTEMPT_FAILED_EVENT missed in TEZ-2325 Key: TEZ-2418 URL: https://issues.apache.org/jira/browse/TEZ-2418 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-2418.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2076) Tez framework to extract/analyze data stored in ATS for specific dag
[ https://issues.apache.org/jira/browse/TEZ-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533354#comment-14533354 ] Jonathan Eagles commented on TEZ-2076: -- [~rajesh.balamohan], regarding the TezTaskAttemptID.fromString slowness. I have put up a patch to TEZ-1526 that may help understand some of the parameters of slowness above. Unfortunately, I haven't been able to fully codify a solution, but perhaps you could take a look to see if it addresses some of the issues you see above? Tez framework to extract/analyze data stored in ATS for specific dag Key: TEZ-2076 URL: https://issues.apache.org/jira/browse/TEZ-2076 Project: Apache Tez Issue Type: Improvement Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-2076.1.patch, TEZ-2076.10.patch, TEZ-2076.2.patch, TEZ-2076.3.patch, TEZ-2076.4.patch, TEZ-2076.5.patch, TEZ-2076.6.patch, TEZ-2076.7.patch, TEZ-2076.8.patch, TEZ-2076.9.patch, TEZ-2076.WIP.2.patch, TEZ-2076.WIP.3.patch, TEZ-2076.WIP.patch - Users should be able to download ATS data pertaining to a DAG from Tez-UI (more like a zip file containing DAG/Vertex/Task/TaskAttempt info). - This can be plugged to an analyzer which parses the data, adds semantics and provides an in-memory representation for further analysis. - This will enable to write different analyzer rules, which can be run on top of this in-memory representation to come up with analysis on the DAG. - Results of this analyzer rules can be rendered on to UI (standalone webapp) later point in time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1526) LoadingCache for TezTaskID slow for large jobs
[ https://issues.apache.org/jira/browse/TEZ-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated TEZ-1526: - Attachment: TEZ-1526.4.patch LoadingCache for TezTaskID slow for large jobs -- Key: TEZ-1526 URL: https://issues.apache.org/jira/browse/TEZ-1526 Project: Apache Tez Issue Type: Improvement Reporter: Jonathan Eagles Assignee: Jonathan Eagles Labels: performance Attachments: 10-TezTaskIDs.patch, TEZ-1526-v1.patch, TEZ-1526-v2.patch, TEZ-1526.3.patch, TEZ-1526.4.patch Using the LoadingCache with default builder settings. 100,000 TezTaskIDs are created in 10 seconds on my setup. With a LoadingCache initialCapacity of 10,000 they are created in 300 ms. With no LoadingCache, they are created in 10 ms. A test case in attached to illustrate the condition I would like to be sped up. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2421) Deadlock in AM because attempt and vertex locking each other out
[ https://issues.apache.org/jira/browse/TEZ-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-2421: Attachment: TEZ-2421.1.patch Deadlock in AM because attempt and vertex locking each other out Key: TEZ-2421 URL: https://issues.apache.org/jira/browse/TEZ-2421 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-2421.1.patch Ideally locks should be taken one way - either going down or up. Preferably not going up because most such data can be passed in during object construction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1961) Remove misleading exception No running dag from AM logs
[ https://issues.apache.org/jira/browse/TEZ-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533607#comment-14533607 ] Jeff Zhang commented on TEZ-1961: - bq. That was because it returns session status, which makes no sense in non-session mode. I check the DAGClientHandler#getTezAppMasterStatus, it just return the DAGAppMaster's state, The state should be also valid in non-session mode, right ? Remove misleading exception No running dag from AM logs - Key: TEZ-1961 URL: https://issues.apache.org/jira/browse/TEZ-1961 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Jeff Zhang Priority: Critical Attachments: TEZ-1961-1.patch, TEZ-1961-2.patch, TEZ-1961-3.patch {code} 15/01/14 16:45:06 INFO ipc.Server: IPC Server handler 0 on 51000, call org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus from Call#0 Retry#0 org.apache.tez.dag.api.TezException: No running dag at present at org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:84) at org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:151) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:94) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7375) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2041) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2037) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2035) 15/01/14 16:45:06 INFO client.DAGClientImpl: DAG initialized: CurrentState=Running {code} This exception shows up fairly often and isn't very relevant - queries before a DAG is submitted to the AM. This is very misleading, especially for folks new to Tez, and should be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2221) VertexGroup name should be unqiue
[ https://issues.apache.org/jira/browse/TEZ-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533628#comment-14533628 ] Jeff Zhang commented on TEZ-2221: - [~rohini] It's been committed, suppose it won't affect pig any more VertexGroup name should be unqiue - Key: TEZ-2221 URL: https://issues.apache.org/jira/browse/TEZ-2221 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.7.0, 0.5.4, 0.6.1 Attachments: TEZ-2221-1.patch, TEZ-2221-2.patch, TEZ-2221-3.patch, TEZ-2221-4.patch, TEZ-2221-5-revert.patch VertexGroupCommitStartedEvent VertexGroupCommitFinishedEvent use vertex group name to identify the vertex group commit, the same name of vertex group will conflict. While in the current equals hashCode of VertexGroup, vertex group name and members name are used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1961) Remove misleading exception No running dag from AM logs
[ https://issues.apache.org/jira/browse/TEZ-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533653#comment-14533653 ] Jeff Zhang edited comment on TEZ-1961 at 5/8/15 12:17 AM: -- I also rename DAGClientHandler#getSessionState to DAGClientHandler#getAMState and make it support non-session mode in this patch to avoid misleading. If it is valid, I can do it in this patch, because the it is a very simple change. was (Author: zjffdu): I also rename DAGClientHandler#getSessionState to DAGClientHandler#getAMState in this patch to avoid misleading. If it is valid, I can do it in this patch, because the it is a very simple change. Remove misleading exception No running dag from AM logs - Key: TEZ-1961 URL: https://issues.apache.org/jira/browse/TEZ-1961 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Jeff Zhang Priority: Critical Attachments: TEZ-1961-1.patch, TEZ-1961-2.patch, TEZ-1961-3.patch {code} 15/01/14 16:45:06 INFO ipc.Server: IPC Server handler 0 on 51000, call org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus from Call#0 Retry#0 org.apache.tez.dag.api.TezException: No running dag at present at org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:84) at org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:151) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:94) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7375) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2041) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2037) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2035) 15/01/14 16:45:06 INFO client.DAGClientImpl: DAG initialized: CurrentState=Running {code} This exception shows up fairly often and isn't very relevant - queries before a DAG is submitted to the AM. This is very misleading, especially for folks new to Tez, and should be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1961) Remove misleading exception No running dag from AM logs
[ https://issues.apache.org/jira/browse/TEZ-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533653#comment-14533653 ] Jeff Zhang commented on TEZ-1961: - I also rename DAGClientHandler#getSessionState to DAGClientHandler#getAMState in this patch to avoid misleading. If it is valid, I can do it in this patch, because the it is a very simple change. Remove misleading exception No running dag from AM logs - Key: TEZ-1961 URL: https://issues.apache.org/jira/browse/TEZ-1961 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Jeff Zhang Priority: Critical Attachments: TEZ-1961-1.patch, TEZ-1961-2.patch, TEZ-1961-3.patch {code} 15/01/14 16:45:06 INFO ipc.Server: IPC Server handler 0 on 51000, call org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus from Call#0 Retry#0 org.apache.tez.dag.api.TezException: No running dag at present at org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:84) at org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:151) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:94) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7375) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2041) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2037) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2035) 15/01/14 16:45:06 INFO client.DAGClientImpl: DAG initialized: CurrentState=Running {code} This exception shows up fairly often and isn't very relevant - queries before a DAG is submitted to the AM. This is very misleading, especially for folks new to Tez, and should be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2426) Task input not complete before sending Task completed event
[ https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-2426: Attachment: TEZ-2426.1.txt This should fix it. Main changes in the patch - Wait for the eventRouter thread to complete before considering a task as done and accepting the next one. - Fixed visibility concerns in *Context. - Moved some of the cleanup into LogicalIOProcessorRuntimeTask - since progress() etc can happen often and shouldn't hit a volatile. Task input not complete before sending Task completed event --- Key: TEZ-2426 URL: https://issues.apache.org/jira/browse/TEZ-2426 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Priority: Critical Attachments: TEZ-2426.1.txt, am.log, container.log Sequence of events 1) Task A starts in a container 2) Task A complete event comes to AM 3) Task B starts in the same container 4) Task A's input calls some method on its context. Crashes with NPE 5) The crash sends an input failed event for Task A to the AM 6) Task A state machine crashes saying cannot handle failed after success In some cases, it could be that status update event is also sent after completion, though not sure if its related to the failed event being sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2410) VertexGroupCommitFinishedEvent VertexCommitStartedEvent is not logged correctly
[ https://issues.apache.org/jira/browse/TEZ-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-2410: Attachment: TEZ-2410-3.patch VertexGroupCommitFinishedEvent VertexCommitStartedEvent is not logged correctly - Key: TEZ-2410 URL: https://issues.apache.org/jira/browse/TEZ-2410 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Priority: Blocker Attachments: TEZ-2410-1.patch, TEZ-2410-1.patch, TEZ-2410-2.patch, TEZ-2410-3.patch VertexGroupCommitFinishedEvent may be logged for non-vertex group commits. VertexGroupCommitFinishedEvent may be logged for each member vertex of the group instead of once per group. VertexCommitStartedEvent may be logged for each output of vertex -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2421) Deadlock in AM because attempt and vertex locking each other out
[ https://issues.apache.org/jira/browse/TEZ-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-2421: Priority: Blocker (was: Major) Deadlock in AM because attempt and vertex locking each other out Key: TEZ-2421 URL: https://issues.apache.org/jira/browse/TEZ-2421 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Priority: Blocker Attachments: TEZ-2421.1.patch Ideally locks should be taken one way - either going down or up. Preferably not going up because most such data can be passed in during object construction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2421) Deadlock in AM because attempt and vertex locking each other out
[ https://issues.apache.org/jira/browse/TEZ-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533734#comment-14533734 ] Bikas Saha commented on TEZ-2421: - The main issue is that the attempt takes a lock upwards into the vertex while vertex takes locks downwards into the attempt. One way has to be broken to prevent deadlock. The key culprits are getting the remoteTaskSpec and getting the taskLocation. Instead of the attempt up-calling into the vertex to get these after getting scheduled, the vertex is now sending these to the task when it schedules the task. [~zjffdu] [~sseth] [~hitesh] Please review. Deadlock in AM because attempt and vertex locking each other out Key: TEZ-2421 URL: https://issues.apache.org/jira/browse/TEZ-2421 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-2421.1.patch Ideally locks should be taken one way - either going down or up. Preferably not going up because most such data can be passed in during object construction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2410) VertexGroupCommitFinishedEvent VertexCommitStartedEvent is not logged correctly
[ https://issues.apache.org/jira/browse/TEZ-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533733#comment-14533733 ] Jeff Zhang commented on TEZ-2410: - bq. Did not find any testDAGCommitSucceeded_OnDAGSuccess. Somewhere there should be this check for vertex v1 commit on dag success, right? historyEventHandler.verifyVertexGroupCommitFinishedEvent(v1, 0); What do you mean ? historyEventHandler.verifyVertexGroupCommitFinishedEvent(v1, 0); is in TestCommit#testDAGCommitSucceeded_OnDAGSuccess VertexGroupCommitFinishedEvent VertexCommitStartedEvent is not logged correctly - Key: TEZ-2410 URL: https://issues.apache.org/jira/browse/TEZ-2410 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Priority: Blocker Attachments: TEZ-2410-1.patch, TEZ-2410-1.patch, TEZ-2410-2.patch, TEZ-2410-3.patch VertexGroupCommitFinishedEvent may be logged for non-vertex group commits. VertexGroupCommitFinishedEvent may be logged for each member vertex of the group instead of once per group. VertexCommitStartedEvent may be logged for each output of vertex -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1961) Remove misleading exception No running dag from AM logs
[ https://issues.apache.org/jira/browse/TEZ-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533647#comment-14533647 ] Bikas Saha commented on TEZ-1961: - Its actually session state. Perhaps we should do that in a separate jira. Remove misleading exception No running dag from AM logs - Key: TEZ-1961 URL: https://issues.apache.org/jira/browse/TEZ-1961 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Jeff Zhang Priority: Critical Attachments: TEZ-1961-1.patch, TEZ-1961-2.patch, TEZ-1961-3.patch {code} 15/01/14 16:45:06 INFO ipc.Server: IPC Server handler 0 on 51000, call org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus from Call#0 Retry#0 org.apache.tez.dag.api.TezException: No running dag at present at org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:84) at org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:151) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:94) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7375) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2041) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2037) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2035) 15/01/14 16:45:06 INFO client.DAGClientImpl: DAG initialized: CurrentState=Running {code} This exception shows up fairly often and isn't very relevant - queries before a DAG is submitted to the AM. This is very misleading, especially for folks new to Tez, and should be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2410) VertexGroupCommitFinishedEvent VertexCommitStartedEvent is not logged correctly
[ https://issues.apache.org/jira/browse/TEZ-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533677#comment-14533677 ] Jeff Zhang commented on TEZ-2410: - [~bikassaha] Sorry for my ugly mistake in my last patch. Upload new patch to address issue in comments. bq. Which test case is covered the VertexImpl change? testVertexCommit_OnVertexSuccess()? All the verification of VertexCommitStartedEvent cover this (Especially testVertexCommit_OnDAGSuccess testVertexCommit_OnVertexSuccess ). With the change in VertexImpl, VertexCommitStartedEvent may be logged multiple times (one time for each output) bq. Which test/check is covering that vertexgroupcommit event is not written for a non-group vertex when all commits happen on dag success? testDAGCommitSucceeded_OnDAGSuccess VertexGroupCommitFinishedEvent VertexCommitStartedEvent is not logged correctly - Key: TEZ-2410 URL: https://issues.apache.org/jira/browse/TEZ-2410 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Priority: Blocker Attachments: TEZ-2410-1.patch, TEZ-2410-1.patch, TEZ-2410-2.patch, TEZ-2410-3.patch VertexGroupCommitFinishedEvent may be logged for non-vertex group commits. VertexGroupCommitFinishedEvent may be logged for each member vertex of the group instead of once per group. VertexCommitStartedEvent may be logged for each output of vertex -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1961) Remove misleading exception No running dag from AM logs
[ https://issues.apache.org/jira/browse/TEZ-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533653#comment-14533653 ] Jeff Zhang edited comment on TEZ-1961 at 5/8/15 12:48 AM: -- I also rename DAGClientHandler#getSessionStatus to DAGClientHandler#getTezAppMasterStatus and make it support non-session mode in this patch to avoid misleading. If it is valid, I can do it in this patch, because the it is a very simple change. {noformat} - public synchronized TezAppMasterStatus getSessionStatus() throws TezException { -if (!dagAppMaster.isSession()) { - throw new TezException(Unsupported operation as AM not running in - + session mode); -} + public synchronized TezAppMasterStatus getTezAppMasterStatus() throws TezException { switch (dagAppMaster.getState()) { case NEW: case INITED: {noformat} was (Author: zjffdu): I also rename DAGClientHandler#getSessionState to DAGClientHandler#getAMState and make it support non-session mode in this patch to avoid misleading. If it is valid, I can do it in this patch, because the it is a very simple change. {noformat} - public synchronized TezAppMasterStatus getSessionStatus() throws TezException { -if (!dagAppMaster.isSession()) { - throw new TezException(Unsupported operation as AM not running in - + session mode); -} + public synchronized TezAppMasterStatus getTezAppMasterStatus() throws TezException { switch (dagAppMaster.getState()) { case NEW: case INITED: {noformat} Remove misleading exception No running dag from AM logs - Key: TEZ-1961 URL: https://issues.apache.org/jira/browse/TEZ-1961 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Jeff Zhang Priority: Critical Attachments: TEZ-1961-1.patch, TEZ-1961-2.patch, TEZ-1961-3.patch {code} 15/01/14 16:45:06 INFO ipc.Server: IPC Server handler 0 on 51000, call org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus from Call#0 Retry#0 org.apache.tez.dag.api.TezException: No running dag at present at org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:84) at org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:151) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:94) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7375) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2041) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2037) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2035) 15/01/14 16:45:06 INFO client.DAGClientImpl: DAG initialized: CurrentState=Running {code} This exception shows up fairly often and isn't very relevant - queries before a DAG is submitted to the AM. This is very misleading, especially for folks new to Tez, and should be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1961) Remove misleading exception No running dag from AM logs
[ https://issues.apache.org/jira/browse/TEZ-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533653#comment-14533653 ] Jeff Zhang edited comment on TEZ-1961 at 5/8/15 12:47 AM: -- I also rename DAGClientHandler#getSessionState to DAGClientHandler#getAMState and make it support non-session mode in this patch to avoid misleading. If it is valid, I can do it in this patch, because the it is a very simple change. {noformat} - public synchronized TezAppMasterStatus getSessionStatus() throws TezException { -if (!dagAppMaster.isSession()) { - throw new TezException(Unsupported operation as AM not running in - + session mode); -} + public synchronized TezAppMasterStatus getTezAppMasterStatus() throws TezException { switch (dagAppMaster.getState()) { case NEW: case INITED: {noformat} was (Author: zjffdu): I also rename DAGClientHandler#getSessionState to DAGClientHandler#getAMState and make it support non-session mode in this patch to avoid misleading. If it is valid, I can do it in this patch, because the it is a very simple change. Remove misleading exception No running dag from AM logs - Key: TEZ-1961 URL: https://issues.apache.org/jira/browse/TEZ-1961 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Jeff Zhang Priority: Critical Attachments: TEZ-1961-1.patch, TEZ-1961-2.patch, TEZ-1961-3.patch {code} 15/01/14 16:45:06 INFO ipc.Server: IPC Server handler 0 on 51000, call org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus from Call#0 Retry#0 org.apache.tez.dag.api.TezException: No running dag at present at org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:84) at org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:151) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:94) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7375) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2041) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2037) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2035) 15/01/14 16:45:06 INFO client.DAGClientImpl: DAG initialized: CurrentState=Running {code} This exception shows up fairly often and isn't very relevant - queries before a DAG is submitted to the AM. This is very misleading, especially for folks new to Tez, and should be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2410) VertexGroupCommitFinishedEvent VertexCommitStartedEvent is not logged correctly
[ https://issues.apache.org/jira/browse/TEZ-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533726#comment-14533726 ] Bikas Saha commented on TEZ-2410: - +1 lgtm. Did not find any testDAGCommitSucceeded_OnDAGSuccess. Somewhere there should be this check for vertex v1 commit on dag success, right? historyEventHandler.verifyVertexGroupCommitFinishedEvent(v1, 0); VertexGroupCommitFinishedEvent VertexCommitStartedEvent is not logged correctly - Key: TEZ-2410 URL: https://issues.apache.org/jira/browse/TEZ-2410 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Priority: Blocker Attachments: TEZ-2410-1.patch, TEZ-2410-1.patch, TEZ-2410-2.patch, TEZ-2410-3.patch VertexGroupCommitFinishedEvent may be logged for non-vertex group commits. VertexGroupCommitFinishedEvent may be logged for each member vertex of the group instead of once per group. VertexCommitStartedEvent may be logged for each output of vertex -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-776) Reduce AM mem usage caused by storing TezEvents
[ https://issues.apache.org/jira/browse/TEZ-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533428#comment-14533428 ] Bikas Saha commented on TEZ-776: Thanks for the reviews. Uploading rebased patch for a jenkins run before committing. Reduce AM mem usage caused by storing TezEvents --- Key: TEZ-776 URL: https://issues.apache.org/jira/browse/TEZ-776 Project: Apache Tez Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Bikas Saha Priority: Blocker Attachments: TEZ-776.1.patch, TEZ-776.10.patch, TEZ-776.11.patch, TEZ-776.12.patch, TEZ-776.13.patch, TEZ-776.14.patch, TEZ-776.15.patch, TEZ-776.2.patch, TEZ-776.3.patch, TEZ-776.4.patch, TEZ-776.5.patch, TEZ-776.6.A.patch, TEZ-776.6.B.patch, TEZ-776.7.patch, TEZ-776.8.patch, TEZ-776.9.patch, TEZ-776.ondemand.1.patch, TEZ-776.ondemand.2.patch, TEZ-776.ondemand.3.patch, TEZ-776.ondemand.4.patch, TEZ-776.ondemand.5.patch, TEZ-776.ondemand.6.patch, TEZ-776.ondemand.7.patch, TEZ-776.ondemand.patch, With_Patch_AM_hotspots.png, With_Patch_AM_profile.png, Without_patch_AM_CPU_Usage.png, events-problem-solutions.txt, with_patch_jmc_output_of_AM.png, without_patch_jmc_output_of_AM.png This is open ended at the moment. A fair chunk of the AM heap is taken up by TezEvents (specifically DataMovementEvents - 64 bytes per event). Depending on the connection pattern - this puts limits on the number of tasks that can be processed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-776) Reduce AM mem usage caused by storing TezEvents
[ https://issues.apache.org/jira/browse/TEZ-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-776: --- Attachment: TEZ-776.15.patch Reduce AM mem usage caused by storing TezEvents --- Key: TEZ-776 URL: https://issues.apache.org/jira/browse/TEZ-776 Project: Apache Tez Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Bikas Saha Priority: Blocker Attachments: TEZ-776.1.patch, TEZ-776.10.patch, TEZ-776.11.patch, TEZ-776.12.patch, TEZ-776.13.patch, TEZ-776.14.patch, TEZ-776.15.patch, TEZ-776.2.patch, TEZ-776.3.patch, TEZ-776.4.patch, TEZ-776.5.patch, TEZ-776.6.A.patch, TEZ-776.6.B.patch, TEZ-776.7.patch, TEZ-776.8.patch, TEZ-776.9.patch, TEZ-776.ondemand.1.patch, TEZ-776.ondemand.2.patch, TEZ-776.ondemand.3.patch, TEZ-776.ondemand.4.patch, TEZ-776.ondemand.5.patch, TEZ-776.ondemand.6.patch, TEZ-776.ondemand.7.patch, TEZ-776.ondemand.patch, With_Patch_AM_hotspots.png, With_Patch_AM_profile.png, Without_patch_AM_CPU_Usage.png, events-problem-solutions.txt, with_patch_jmc_output_of_AM.png, without_patch_jmc_output_of_AM.png This is open ended at the moment. A fair chunk of the AM heap is taken up by TezEvents (specifically DataMovementEvents - 64 bytes per event). Depending on the connection pattern - this puts limits on the number of tasks that can be processed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1961) Remove misleading exception No running dag from AM logs
[ https://issues.apache.org/jira/browse/TEZ-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532244#comment-14532244 ] Jeff Zhang commented on TEZ-1961: - [~sseth] Please help review it. * Wait for DAGAppMaster go to RUNNING/SHUTDOWN, then return DAGClient in non-session mode. This can ensure that dag has started to run. [~bikassaha] Previously DAGClientAMProtocol#getAMStatus is not supported for non-session mode, is there any considering for that ? I make it supported under non-session mode in this patch. * The patch cause a little difference on the tracking URL of application. This is one bug of YARN which has been been solved in YARN-2246 (solved in hadoop-2.7) The bug is there may be some suffix at the end of trackingURL when app move from SUBMITTED to RUNNING. So after this patch, the trackingURL will change from http://localhost:53419/proxy/application_1430963524753_0005 to http://localhost:53419/proxy/application_1430963524753_0005/ui/ * Still keep the null currentDAG check in DAGClientHandler as sanity check. Remove misleading exception No running dag from AM logs - Key: TEZ-1961 URL: https://issues.apache.org/jira/browse/TEZ-1961 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Jeff Zhang Priority: Critical Attachments: TEZ-1961-1.patch, TEZ-1961-2.patch, TEZ-1961-3.patch {code} 15/01/14 16:45:06 INFO ipc.Server: IPC Server handler 0 on 51000, call org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus from Call#0 Retry#0 org.apache.tez.dag.api.TezException: No running dag at present at org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:84) at org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:151) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:94) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7375) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2041) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2037) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2035) 15/01/14 16:45:06 INFO client.DAGClientImpl: DAG initialized: CurrentState=Running {code} This exception shows up fairly often and isn't very relevant - queries before a DAG is submitted to the AM. This is very misleading, especially for folks new to Tez, and should be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2426) Task input not complete before sending Task completed event
[ https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-2426: Attachment: TEZ-2426.2.txt Updated patch to remove some unnecessary synchronization which causes the findbugs issues. Task input not complete before sending Task completed event --- Key: TEZ-2426 URL: https://issues.apache.org/jira/browse/TEZ-2426 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Bikas Saha Assignee: Siddharth Seth Priority: Critical Attachments: TEZ-2426.1.txt, TEZ-2426.2.txt, am.log, container.log Sequence of events 1) Task A starts in a container 2) Task A complete event comes to AM 3) Task B starts in the same container 4) Task A's input calls some method on its context. Crashes with NPE 5) The crash sends an input failed event for Task A to the AM 6) Task A state machine crashes saying cannot handle failed after success In some cases, it could be that status update event is also sent after completion, though not sure if its related to the failed event being sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event
[ https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533861#comment-14533861 ] Siddharth Seth commented on TEZ-2426: - [~rajesh.balamohan], [~bikassaha], [~zjffdu] - please review. Task input not complete before sending Task completed event --- Key: TEZ-2426 URL: https://issues.apache.org/jira/browse/TEZ-2426 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Bikas Saha Assignee: Siddharth Seth Priority: Critical Attachments: TEZ-2426.1.txt, TEZ-2426.2.txt, am.log, container.log Sequence of events 1) Task A starts in a container 2) Task A complete event comes to AM 3) Task B starts in the same container 4) Task A's input calls some method on its context. Crashes with NPE 5) The crash sends an input failed event for Task A to the AM 6) Task A state machine crashes saying cannot handle failed after success In some cases, it could be that status update event is also sent after completion, though not sure if its related to the failed event being sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event
[ https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533862#comment-14533862 ] Siddharth Seth commented on TEZ-2426: - Tested on a large noop job - ran through without any issues. Task input not complete before sending Task completed event --- Key: TEZ-2426 URL: https://issues.apache.org/jira/browse/TEZ-2426 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Bikas Saha Assignee: Siddharth Seth Priority: Critical Attachments: TEZ-2426.1.txt, TEZ-2426.2.txt, am.log, container.log Sequence of events 1) Task A starts in a container 2) Task A complete event comes to AM 3) Task B starts in the same container 4) Task A's input calls some method on its context. Crashes with NPE 5) The crash sends an input failed event for Task A to the AM 6) Task A state machine crashes saying cannot handle failed after success In some cases, it could be that status update event is also sent after completion, though not sure if its related to the failed event being sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2412) Should kill vertex in DAGImpl#VertexRerunWhileCommitting
[ https://issues.apache.org/jira/browse/TEZ-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-2412: Attachment: TEZ-2412-2.patch Rebase the patch and add some comments in code. Should kill vertex in DAGImpl#VertexRerunWhileCommitting Key: TEZ-2412 URL: https://issues.apache.org/jira/browse/TEZ-2412 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Priority: Blocker Attachments: TEZ-2412-1.patch, TEZ-2412-2.patch * When vertex rerun, it move to RUNNING state, so should kill it in DAGImpl#VertexRerunWhileCommitting -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event
[ https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533764#comment-14533764 ] TezQA commented on TEZ-2426: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12731316/TEZ-2426.1.txt against master revision 05f77fe. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/651//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/651//artifact/patchprocess/newPatchFindbugsWarningstez-runtime-internals.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/651//console This message is automatically generated. Task input not complete before sending Task completed event --- Key: TEZ-2426 URL: https://issues.apache.org/jira/browse/TEZ-2426 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Bikas Saha Assignee: Siddharth Seth Priority: Critical Attachments: TEZ-2426.1.txt, am.log, container.log Sequence of events 1) Task A starts in a container 2) Task A complete event comes to AM 3) Task B starts in the same container 4) Task A's input calls some method on its context. Crashes with NPE 5) The crash sends an input failed event for Task A to the AM 6) Task A state machine crashes saying cannot handle failed after success In some cases, it could be that status update event is also sent after completion, though not sure if its related to the failed event being sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2429) Tez AM does not die after hitting internal error
[ https://issues.apache.org/jira/browse/TEZ-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533776#comment-14533776 ] Jeff Zhang edited comment on TEZ-2429 at 5/8/15 2:47 AM: - Can reproduce the InvalidTransition in TestFaultTolerance, looking at the cause {code} 2015-05-06 23:55:54,421 ERROR [Dispatcher thread: Central] impl.DAGImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: DAG_VERTEX_RERUNNING at SUCCEEDED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57) at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1079) at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:143) at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1871) at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1862) at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) at java.lang.Thread.run(Thread.java:662) {code} was (Author: zjffdu): Can produce the InvalidTransition in TestFaultTolerance, looking at the cause {code} 2015-05-06 23:55:54,421 ERROR [Dispatcher thread: Central] impl.DAGImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: DAG_VERTEX_RERUNNING at SUCCEEDED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57) at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1079) at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:143) at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1871) at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1862) at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) at java.lang.Thread.run(Thread.java:662) {code} Tez AM does not die after hitting internal error - Key: TEZ-2429 URL: https://issues.apache.org/jira/browse/TEZ-2429 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Priority: Blocker Attachments: syslog_dag_1430956448478_0001_16_post, syslog_dag_1430956448478_0001_17 From https://builds.apache.org/job/Tez-Build/1055/: 2015-05-06 23:55:54,421 ERROR [Dispatcher thread: Central] impl.DAGImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: DAG_VERTEX_RERUNNING at SUCCEEDED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57) at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1079) at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:143) at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1871) at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1862) at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) at java.lang.Thread.run(Thread.java:662) 2015-05-06 23:55:54,423 INFO [Dispatcher thread: Central] app.DAGAppMaster: Cleaning up DAG: name=testRandomFailingInputs, with id=dag_1430956448478_0001_16 2015-05-06 23:55:54,423 INFO [Dispatcher thread: Central] app.DAGAppMaster: Completed cleanup for DAG: name=testRandomFailingInputs, with id=dag_1430956448478_0001_16 2015-05-06 23:55:54,424 INFO [Dispatcher thread: Central] impl.DAGImpl: dag_1430956448478_0001_16 terminating due to internal error 2015-05-06 23:55:54,433 INFO [IPC Server handler 0 on 47432] app.DAGAppMaster: Starting DAG submitted via
[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event
[ https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533824#comment-14533824 ] TezQA commented on TEZ-2426: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12731348/TEZ-2426.2.txt against master revision 05f77fe. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/654//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/654//console This message is automatically generated. Task input not complete before sending Task completed event --- Key: TEZ-2426 URL: https://issues.apache.org/jira/browse/TEZ-2426 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Bikas Saha Assignee: Siddharth Seth Priority: Critical Attachments: TEZ-2426.1.txt, TEZ-2426.2.txt, am.log, container.log Sequence of events 1) Task A starts in a container 2) Task A complete event comes to AM 3) Task B starts in the same container 4) Task A's input calls some method on its context. Crashes with NPE 5) The crash sends an input failed event for Task A to the AM 6) Task A state machine crashes saying cannot handle failed after success In some cases, it could be that status update event is also sent after completion, though not sure if its related to the failed event being sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: TEZ-2426 PreCommit Build #654
Jira: https://issues.apache.org/jira/browse/TEZ-2426 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/654/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2843 lines...] [INFO] Final Memory: 76M/933M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12731348/TEZ-2426.2.txt against master revision 05f77fe. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/654//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/654//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. be199b38066bd334d1edd0003b0bd729e1106855 logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #652 Archived 44 artifacts Archive block size is 32768 Received 22 blocks and 2056276 bytes Compression is 26.0% Took 2 sec Description set: TEZ-2426 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
Failed: TEZ-2421 PreCommit Build #655
Jira: https://issues.apache.org/jira/browse/TEZ-2421 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/655/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2665 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12731352/TEZ-2421.3.patch against master revision 05f77fe. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.test.TestAMRecovery org.apache.tez.test.TestDAGRecovery Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/655//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/655//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 822086f2d482afc631ba47942dee53d56cefd63e logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #654 Archived 44 artifacts Archive block size is 32768 Received 18 blocks and 2176850 bytes Compression is 21.3% Took 1.3 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## 8 tests failed. REGRESSION: org.apache.tez.test.TestAMRecovery.testVertexPartiallyFinished_Broadcast Error Message: expected:SUCCEEDED but was:FAILED Stack Trace: java.lang.AssertionError: expected:SUCCEEDED but was:FAILED at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.tez.test.TestAMRecovery.runDAGAndVerify(TestAMRecovery.java:412) at org.apache.tez.test.TestAMRecovery.testVertexPartiallyFinished_Broadcast(TestAMRecovery.java:206) REGRESSION: org.apache.tez.test.TestAMRecovery.testVertexPartialFinished_One2One Error Message: expected:SUCCEEDED but was:FAILED Stack Trace: java.lang.AssertionError: expected:SUCCEEDED but was:FAILED at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.tez.test.TestAMRecovery.runDAGAndVerify(TestAMRecovery.java:412) at org.apache.tez.test.TestAMRecovery.testVertexPartialFinished_One2One(TestAMRecovery.java:268) REGRESSION: org.apache.tez.test.TestAMRecovery.testVertexPartiallyFinished_ScatterGather Error Message: expected:SUCCEEDED but was:FAILED Stack Trace: java.lang.AssertionError: expected:SUCCEEDED but was:FAILED at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.tez.test.TestAMRecovery.runDAGAndVerify(TestAMRecovery.java:412) at org.apache.tez.test.TestAMRecovery.testVertexPartiallyFinished_ScatterGather(TestAMRecovery.java:332) REGRESSION: org.apache.tez.test.TestAMRecovery.testVertexCompletelyFinished_ScatterGather Error Message: expected:SUCCEEDED but was:FAILED Stack Trace: java.lang.AssertionError: expected:SUCCEEDED but was:FAILED at org.junit.Assert.fail(Assert.java:88) at
[jira] [Commented] (TEZ-2421) Deadlock in AM because attempt and vertex locking each other out
[ https://issues.apache.org/jira/browse/TEZ-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533857#comment-14533857 ] TezQA commented on TEZ-2421: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12731352/TEZ-2421.3.patch against master revision 05f77fe. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.test.TestAMRecovery org.apache.tez.test.TestDAGRecovery Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/655//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/655//console This message is automatically generated. Deadlock in AM because attempt and vertex locking each other out Key: TEZ-2421 URL: https://issues.apache.org/jira/browse/TEZ-2421 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Priority: Blocker Attachments: TEZ-2421.1.patch, TEZ-2421.2.patch, TEZ-2421.3.patch Ideally locks should be taken one way - either going down or up. Preferably not going up because most such data can be passed in during object construction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2426 PreCommit Build #651
Jira: https://issues.apache.org/jira/browse/TEZ-2426 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/651/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2851 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12731316/TEZ-2426.1.txt against master revision 05f77fe. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/651//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/651//artifact/patchprocess/newPatchFindbugsWarningstez-runtime-internals.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/651//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 49244363923f36d5d16c2f408d9b04fc45947e43 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #650 Archived 44 artifacts Archive block size is 32768 Received 22 blocks and 2066358 bytes Compression is 25.9% Took 1.6 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2410) VertexGroupCommitFinishedEvent VertexCommitStartedEvent is not logged correctly
[ https://issues.apache.org/jira/browse/TEZ-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533775#comment-14533775 ] TezQA commented on TEZ-2410: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12731315/TEZ-2410-3.patch against master revision 05f77fe. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/652//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/652//console This message is automatically generated. VertexGroupCommitFinishedEvent VertexCommitStartedEvent is not logged correctly - Key: TEZ-2410 URL: https://issues.apache.org/jira/browse/TEZ-2410 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Priority: Blocker Attachments: TEZ-2410-1.patch, TEZ-2410-1.patch, TEZ-2410-2.patch, TEZ-2410-3.patch VertexGroupCommitFinishedEvent may be logged for non-vertex group commits. VertexGroupCommitFinishedEvent may be logged for each member vertex of the group instead of once per group. VertexCommitStartedEvent may be logged for each output of vertex -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: TEZ-2410 PreCommit Build #652
Jira: https://issues.apache.org/jira/browse/TEZ-2410 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/652/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2842 lines...] [INFO] Final Memory: 71M/945M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12731315/TEZ-2410-3.patch against master revision 05f77fe. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/652//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/652//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 47b641bb1b1931096578ad216acbafa6b125f8bc logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #650 Archived 44 artifacts Archive block size is 32768 Received 4 blocks and 2647421 bytes Compression is 4.7% Took 1.1 sec Description set: TEZ-2410 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
Failed: TEZ-2421 PreCommit Build #653
Jira: https://issues.apache.org/jira/browse/TEZ-2421 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/653/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2452 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12731341/TEZ-2421.2.patch against master revision 05f77fe. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.dag.app.dag.impl.TestDAGImpl Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/653//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/653//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 6dfefc7cbf88aa396327d10f8c9ae72fe629af3f logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #650 Archived 44 artifacts Archive block size is 32768 Received 19 blocks and 2123407 bytes Compression is 22.7% Took 1.3 sec [description-setter] Could not determine description. Recording test results Publish JUnit test result report is waiting for a checkpoint on PreCommit-TEZ-Build #652 Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## 2 tests failed. REGRESSION: org.apache.tez.dag.app.dag.impl.TestDAGImpl.testEdgeManager_GetNumSourceTaskPhysicalOutputs Error Message: null Stack Trace: java.lang.NullPointerException: null at org.apache.tez.dag.app.dag.impl.TestDAGImpl.testEdgeManager_GetNumSourceTaskPhysicalOutputs(TestDAGImpl.java:1004) REGRESSION: org.apache.tez.dag.app.dag.impl.TestDAGImpl.testEdgeManager_GetNumDestinationTaskPhysicalInputs Error Message: null Stack Trace: java.lang.NullPointerException: null at org.apache.tez.dag.app.dag.impl.TestDAGImpl.testEdgeManager_GetNumDestinationTaskPhysicalInputs(TestDAGImpl.java:982)
[jira] [Commented] (TEZ-2429) Tez AM does not die after hitting internal error
[ https://issues.apache.org/jira/browse/TEZ-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533776#comment-14533776 ] Jeff Zhang commented on TEZ-2429: - Can produce the InvalidTransition in TestFaultTolerance, looking at the cause {code} 2015-05-06 23:55:54,421 ERROR [Dispatcher thread: Central] impl.DAGImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: DAG_VERTEX_RERUNNING at SUCCEEDED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57) at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1079) at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:143) at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1871) at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1862) at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) at java.lang.Thread.run(Thread.java:662) {code} Tez AM does not die after hitting internal error - Key: TEZ-2429 URL: https://issues.apache.org/jira/browse/TEZ-2429 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Priority: Blocker Attachments: syslog_dag_1430956448478_0001_16_post, syslog_dag_1430956448478_0001_17 From https://builds.apache.org/job/Tez-Build/1055/: 2015-05-06 23:55:54,421 ERROR [Dispatcher thread: Central] impl.DAGImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: DAG_VERTEX_RERUNNING at SUCCEEDED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57) at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1079) at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:143) at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1871) at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1862) at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) at java.lang.Thread.run(Thread.java:662) 2015-05-06 23:55:54,423 INFO [Dispatcher thread: Central] app.DAGAppMaster: Cleaning up DAG: name=testRandomFailingInputs, with id=dag_1430956448478_0001_16 2015-05-06 23:55:54,423 INFO [Dispatcher thread: Central] app.DAGAppMaster: Completed cleanup for DAG: name=testRandomFailingInputs, with id=dag_1430956448478_0001_16 2015-05-06 23:55:54,424 INFO [Dispatcher thread: Central] impl.DAGImpl: dag_1430956448478_0001_16 terminating due to internal error 2015-05-06 23:55:54,433 INFO [IPC Server handler 0 on 47432] app.DAGAppMaster: Starting DAG submitted via RPC: testBasicInputFailureWithExit 2015-05-06 23:55:54,455 ERROR [Dispatcher thread: Central] common.AsyncDispatcher: Error in dispatcher thread java.lang.NullPointerException at org.apache.tez.dag.history.recovery.RecoveryService.doFlush(RecoveryService.java:458) at org.apache.tez.dag.history.recovery.RecoveryService.handle(RecoveryService.java:289) at org.apache.tez.dag.history.HistoryEventHandler.handleCriticalEvent(HistoryEventHandler.java:102) at org.apache.tez.dag.app.dag.impl.DAGImpl.logJobHistoryUnsuccesfulEvent(DAGImpl.java:1161) at org.apache.tez.dag.app.dag.impl.DAGImpl.finished(DAGImpl.java:1275) at org.apache.tez.dag.app.dag.impl.DAGImpl.access$2600(DAGImpl.java:144) at org.apache.tez.dag.app.dag.impl.DAGImpl$InternalErrorTransition.transition(DAGImpl.java:2151) at org.apache.tez.dag.app.dag.impl.DAGImpl$InternalErrorTransition.transition(DAGImpl.java:2140) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
[jira] [Commented] (TEZ-2410) VertexGroupCommitFinishedEvent VertexCommitStartedEvent is not logged correctly
[ https://issues.apache.org/jira/browse/TEZ-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533851#comment-14533851 ] Jeff Zhang commented on TEZ-2410: - Committed to branch-0.7 master. VertexGroupCommitFinishedEvent VertexCommitStartedEvent is not logged correctly - Key: TEZ-2410 URL: https://issues.apache.org/jira/browse/TEZ-2410 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Priority: Blocker Fix For: 0.7.0 Attachments: TEZ-2410-1.patch, TEZ-2410-1.patch, TEZ-2410-2.patch, TEZ-2410-3.patch VertexGroupCommitFinishedEvent may be logged for non-vertex group commits. VertexGroupCommitFinishedEvent may be logged for each member vertex of the group instead of once per group. VertexCommitStartedEvent may be logged for each output of vertex -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2421) Deadlock in AM because attempt and vertex locking each other out
[ https://issues.apache.org/jira/browse/TEZ-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-2421: Attachment: TEZ-2421.2.patch Patch with few more tests Deadlock in AM because attempt and vertex locking each other out Key: TEZ-2421 URL: https://issues.apache.org/jira/browse/TEZ-2421 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Priority: Blocker Attachments: TEZ-2421.1.patch, TEZ-2421.2.patch Ideally locks should be taken one way - either going down or up. Preferably not going up because most such data can be passed in during object construction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2430) Add test for RecoveryEvent Spec
[ https://issues.apache.org/jira/browse/TEZ-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-2430: Target Version/s: 0.7.1 Add test for RecoveryEvent Spec --- Key: TEZ-2430 URL: https://issues.apache.org/jira/browse/TEZ-2430 Project: Apache Tez Issue Type: Sub-task Reporter: Jeff Zhang Assignee: Jeff Zhang * Ordering of RecoveryEvents. ** DataMovementEvent must be logged before TaskAttemptFinishedEvent ** InputDataInfoEvent must be logged before VertexInitializedEvent (already covered in the existing test) * Frequency of RecoveryEvent. e.g. TaskAttemptStartedEvent can only been logged once, but TaskAttemptFinishedEvent can been logged twice. (TaskAttempt transit from SUCCEEDED to FAILED) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2421) Deadlock in AM because attempt and vertex locking each other out
[ https://issues.apache.org/jira/browse/TEZ-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533771#comment-14533771 ] TezQA commented on TEZ-2421: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12731341/TEZ-2421.2.patch against master revision 05f77fe. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.dag.app.dag.impl.TestDAGImpl Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/653//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/653//console This message is automatically generated. Deadlock in AM because attempt and vertex locking each other out Key: TEZ-2421 URL: https://issues.apache.org/jira/browse/TEZ-2421 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Priority: Blocker Attachments: TEZ-2421.1.patch, TEZ-2421.2.patch Ideally locks should be taken one way - either going down or up. Preferably not going up because most such data can be passed in during object construction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2429) Tez AM does not die after hitting internal error
[ https://issues.apache.org/jira/browse/TEZ-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533815#comment-14533815 ] Bikas Saha commented on TEZ-2429: - The main issue though, is whether the AM does not shutdown after the InternalError. If it shuts down then this should not be a blocker for 0.7.0. Tez AM does not die after hitting internal error - Key: TEZ-2429 URL: https://issues.apache.org/jira/browse/TEZ-2429 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Priority: Blocker Attachments: syslog_dag_1430956448478_0001_16_post, syslog_dag_1430956448478_0001_17 From https://builds.apache.org/job/Tez-Build/1055/: 2015-05-06 23:55:54,421 ERROR [Dispatcher thread: Central] impl.DAGImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: DAG_VERTEX_RERUNNING at SUCCEEDED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57) at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1079) at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:143) at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1871) at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1862) at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) at java.lang.Thread.run(Thread.java:662) 2015-05-06 23:55:54,423 INFO [Dispatcher thread: Central] app.DAGAppMaster: Cleaning up DAG: name=testRandomFailingInputs, with id=dag_1430956448478_0001_16 2015-05-06 23:55:54,423 INFO [Dispatcher thread: Central] app.DAGAppMaster: Completed cleanup for DAG: name=testRandomFailingInputs, with id=dag_1430956448478_0001_16 2015-05-06 23:55:54,424 INFO [Dispatcher thread: Central] impl.DAGImpl: dag_1430956448478_0001_16 terminating due to internal error 2015-05-06 23:55:54,433 INFO [IPC Server handler 0 on 47432] app.DAGAppMaster: Starting DAG submitted via RPC: testBasicInputFailureWithExit 2015-05-06 23:55:54,455 ERROR [Dispatcher thread: Central] common.AsyncDispatcher: Error in dispatcher thread java.lang.NullPointerException at org.apache.tez.dag.history.recovery.RecoveryService.doFlush(RecoveryService.java:458) at org.apache.tez.dag.history.recovery.RecoveryService.handle(RecoveryService.java:289) at org.apache.tez.dag.history.HistoryEventHandler.handleCriticalEvent(HistoryEventHandler.java:102) at org.apache.tez.dag.app.dag.impl.DAGImpl.logJobHistoryUnsuccesfulEvent(DAGImpl.java:1161) at org.apache.tez.dag.app.dag.impl.DAGImpl.finished(DAGImpl.java:1275) at org.apache.tez.dag.app.dag.impl.DAGImpl.access$2600(DAGImpl.java:144) at org.apache.tez.dag.app.dag.impl.DAGImpl$InternalErrorTransition.transition(DAGImpl.java:2151) at org.apache.tez.dag.app.dag.impl.DAGImpl$InternalErrorTransition.transition(DAGImpl.java:2140) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57) at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1079) at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:143) at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1871) at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1862) at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) at java.lang.Thread.run(Thread.java:662) 2015-05-06 23:55:54,456 INFO [Dispatcher thread: Central] impl.VertexImpl: Killing tasks in vertex: vertex_1430956448478_0001_16_10 [l4v1] due to trigger: INTERNAL_ERROR 2015-05-06 23:55:54,456 INFO [Dispatcher thread: Central] impl.VertexImpl: vertex_1430956448478_0001_16_10
[jira] [Comment Edited] (TEZ-2429) Tez AM does not die after hitting internal error
[ https://issues.apache.org/jira/browse/TEZ-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533815#comment-14533815 ] Bikas Saha edited comment on TEZ-2429 at 5/8/15 3:58 AM: - The main issue though, is whether the AM does not shutdown after the InternalError. If it shuts down then this should not be a blocker for 0.7.0. Can be fixed after 0.7.0 unless the hang is a regression. was (Author: bikassaha): The main issue though, is whether the AM does not shutdown after the InternalError. If it shuts down then this should not be a blocker for 0.7.0. Tez AM does not die after hitting internal error - Key: TEZ-2429 URL: https://issues.apache.org/jira/browse/TEZ-2429 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Priority: Blocker Attachments: syslog_dag_1430956448478_0001_16_post, syslog_dag_1430956448478_0001_17 From https://builds.apache.org/job/Tez-Build/1055/: 2015-05-06 23:55:54,421 ERROR [Dispatcher thread: Central] impl.DAGImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: DAG_VERTEX_RERUNNING at SUCCEEDED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57) at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1079) at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:143) at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1871) at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1862) at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) at java.lang.Thread.run(Thread.java:662) 2015-05-06 23:55:54,423 INFO [Dispatcher thread: Central] app.DAGAppMaster: Cleaning up DAG: name=testRandomFailingInputs, with id=dag_1430956448478_0001_16 2015-05-06 23:55:54,423 INFO [Dispatcher thread: Central] app.DAGAppMaster: Completed cleanup for DAG: name=testRandomFailingInputs, with id=dag_1430956448478_0001_16 2015-05-06 23:55:54,424 INFO [Dispatcher thread: Central] impl.DAGImpl: dag_1430956448478_0001_16 terminating due to internal error 2015-05-06 23:55:54,433 INFO [IPC Server handler 0 on 47432] app.DAGAppMaster: Starting DAG submitted via RPC: testBasicInputFailureWithExit 2015-05-06 23:55:54,455 ERROR [Dispatcher thread: Central] common.AsyncDispatcher: Error in dispatcher thread java.lang.NullPointerException at org.apache.tez.dag.history.recovery.RecoveryService.doFlush(RecoveryService.java:458) at org.apache.tez.dag.history.recovery.RecoveryService.handle(RecoveryService.java:289) at org.apache.tez.dag.history.HistoryEventHandler.handleCriticalEvent(HistoryEventHandler.java:102) at org.apache.tez.dag.app.dag.impl.DAGImpl.logJobHistoryUnsuccesfulEvent(DAGImpl.java:1161) at org.apache.tez.dag.app.dag.impl.DAGImpl.finished(DAGImpl.java:1275) at org.apache.tez.dag.app.dag.impl.DAGImpl.access$2600(DAGImpl.java:144) at org.apache.tez.dag.app.dag.impl.DAGImpl$InternalErrorTransition.transition(DAGImpl.java:2151) at org.apache.tez.dag.app.dag.impl.DAGImpl$InternalErrorTransition.transition(DAGImpl.java:2140) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57) at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1079) at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:143) at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1871) at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1862) at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) at java.lang.Thread.run(Thread.java:662)
[jira] [Updated] (TEZ-2421) Deadlock in AM because attempt and vertex locking each other out
[ https://issues.apache.org/jira/browse/TEZ-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-2421: Attachment: TEZ-2421.3.patch Patches fixes jenkins test failure Deadlock in AM because attempt and vertex locking each other out Key: TEZ-2421 URL: https://issues.apache.org/jira/browse/TEZ-2421 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Priority: Blocker Attachments: TEZ-2421.1.patch, TEZ-2421.2.patch, TEZ-2421.3.patch Ideally locks should be taken one way - either going down or up. Preferably not going up because most such data can be passed in during object construction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2421) Deadlock in AM because attempt and vertex locking each other out
[ https://issues.apache.org/jira/browse/TEZ-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533904#comment-14533904 ] Jeff Zhang commented on TEZ-2421: - It cause the TestAMRecovery fail. {code} 2015-05-08 13:35:25,672 INFO [Dispatcher thread: Central] impl.VertexImpl: Source task attempt completed for vertex: vertex_1431063298340_0001_1_01 [v2] attempt: attempt_1431063298340_0001_1_00_00_0 with state: SUCCEEDED vertexState: RUNNING 2015-05-08 13:35:25,672 ERROR [Dispatcher thread: Central] common.AsyncDispatcher: Error in dispatcher thread java.lang.NullPointerException at org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.createRemoteTaskSpec(TaskAttemptImpl.java:461) at org.apache.tez.dag.app.dag.impl.TaskAttemptImpl$ScheduleTaskattemptTransition.transition(TaskAttemptImpl.java:1012) at org.apache.tez.dag.app.dag.impl.TaskAttemptImpl$ScheduleTaskattemptTransition.transition(TaskAttemptImpl.java:1) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:673) at org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:1) at org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:1920) at org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:1) at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) at java.lang.Thread.run(Thread.java:745) {code} Deadlock in AM because attempt and vertex locking each other out Key: TEZ-2421 URL: https://issues.apache.org/jira/browse/TEZ-2421 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Priority: Blocker Attachments: TEZ-2421.1.patch, TEZ-2421.2.patch, TEZ-2421.3.patch Ideally locks should be taken one way - either going down or up. Preferably not going up because most such data can be passed in during object construction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1526) LoadingCache for TezTaskID slow for large jobs
[ https://issues.apache.org/jira/browse/TEZ-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated TEZ-1526: - Attachment: TEZ-1526.3.patch LoadingCache for TezTaskID slow for large jobs -- Key: TEZ-1526 URL: https://issues.apache.org/jira/browse/TEZ-1526 Project: Apache Tez Issue Type: Improvement Reporter: Jonathan Eagles Assignee: Jonathan Eagles Labels: performance Attachments: 10-TezTaskIDs.patch, TEZ-1526-v1.patch, TEZ-1526-v2.patch, TEZ-1526.3.patch Using the LoadingCache with default builder settings. 100,000 TezTaskIDs are created in 10 seconds on my setup. With a LoadingCache initialCapacity of 10,000 they are created in 300 ms. With no LoadingCache, they are created in 10 ms. A test case in attached to illustrate the condition I would like to be sped up. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event
[ https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533404#comment-14533404 ] Siddharth Seth commented on TEZ-2426: - Alright. Have a theory on what's happening. Lots of threads involved. This ignores the LOG lines showing up in the wrong log files (assuming the logger doesn't guarantee ordering when logging from different threads). - TaskEventRouter for 456 sees an error. (This can happen because of clean up / some fields not being volatile in inputContext). - TaskEventRouter is swapped out. - TaskCompletes, sends out it's success message (heartbeat) - TaskEventRouter thread regains control - tries sending out the TaskFailed message. (This is all before the next start has started. It may or may not have got an interrupt by this point). - Main thread falls off. Starts running another task. This thread can heartbeat since it doesn't synchronize with the previous tasks heartbeats. - The TaskEventRouter for 465 regains control. Goes into the IPC layer and tries sending the FAILED message (via a future). There's a context switch before the futute.get(). The future runs. future.get() is interrupted, because the thread has seen it's interrupt status by this point. Leads to the various errors in the logs. This doesn't however explain a status_update after the failed message is sent. Don't really see what can cause that. Couple of things which need fixing here 1) Join on the TaskEventRouter 2) Join on the last tasks heartbeat thread 3) Fixes to *Context to revert fields back to final, or volatile 4) Avoid sending any more messages once any one final message has been sent. Task input not complete before sending Task completed event --- Key: TEZ-2426 URL: https://issues.apache.org/jira/browse/TEZ-2426 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Priority: Critical Attachments: am.log, container.log Sequence of events 1) Task A starts in a container 2) Task A complete event comes to AM 3) Task B starts in the same container 4) Task A's input calls some method on its context. Crashes with NPE 5) The crash sends an input failed event for Task A to the AM 6) Task A state machine crashes saying cannot handle failed after success In some cases, it could be that status update event is also sent after completion, though not sure if its related to the failed event being sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2426) Task input not complete before sending Task completed event
[ https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533406#comment-14533406 ] Siddharth Seth commented on TEZ-2426: - Longer term - 0.8, may be worthwhile to rework some of this, along with protocol changes. Task input not complete before sending Task completed event --- Key: TEZ-2426 URL: https://issues.apache.org/jira/browse/TEZ-2426 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Priority: Critical Attachments: am.log, container.log Sequence of events 1) Task A starts in a container 2) Task A complete event comes to AM 3) Task B starts in the same container 4) Task A's input calls some method on its context. Crashes with NPE 5) The crash sends an input failed event for Task A to the AM 6) Task A state machine crashes saying cannot handle failed after success In some cases, it could be that status update event is also sent after completion, though not sure if its related to the failed event being sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-776) Reduce AM mem usage caused by storing TezEvents
[ https://issues.apache.org/jira/browse/TEZ-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533495#comment-14533495 ] TezQA commented on TEZ-776: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12731275/TEZ-776.15.patch against master revision a382324. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 8 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/650//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/650//console This message is automatically generated. Reduce AM mem usage caused by storing TezEvents --- Key: TEZ-776 URL: https://issues.apache.org/jira/browse/TEZ-776 Project: Apache Tez Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Bikas Saha Priority: Blocker Attachments: TEZ-776.1.patch, TEZ-776.10.patch, TEZ-776.11.patch, TEZ-776.12.patch, TEZ-776.13.patch, TEZ-776.14.patch, TEZ-776.15.patch, TEZ-776.2.patch, TEZ-776.3.patch, TEZ-776.4.patch, TEZ-776.5.patch, TEZ-776.6.A.patch, TEZ-776.6.B.patch, TEZ-776.7.patch, TEZ-776.8.patch, TEZ-776.9.patch, TEZ-776.ondemand.1.patch, TEZ-776.ondemand.2.patch, TEZ-776.ondemand.3.patch, TEZ-776.ondemand.4.patch, TEZ-776.ondemand.5.patch, TEZ-776.ondemand.6.patch, TEZ-776.ondemand.7.patch, TEZ-776.ondemand.patch, With_Patch_AM_hotspots.png, With_Patch_AM_profile.png, Without_patch_AM_CPU_Usage.png, events-problem-solutions.txt, with_patch_jmc_output_of_AM.png, without_patch_jmc_output_of_AM.png This is open ended at the moment. A fair chunk of the AM heap is taken up by TezEvents (specifically DataMovementEvents - 64 bytes per event). Depending on the connection pattern - this puts limits on the number of tasks that can be processed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2429) Tez AM does not die after hitting internal error
[ https://issues.apache.org/jira/browse/TEZ-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533587#comment-14533587 ] Bikas Saha commented on TEZ-2429: - Not seeing this on an internal error (due to WIP code) in a cluster {noformat}2015-05-07 16:17:04,876 INFO [main] impl.DAGImpl: Using DAG Scheduler: org.apache.tez.dag.app.dag.impl.DAGSchedulerNaturalOrder 2015-05-07 16:17:04,878 INFO [main] history.HistoryEventHandler: [HISTORY][DAG:dag_1429683757595_0799_1][Event:DAG_INITIALIZED]: dagID=dag_1429683757595_0799_1, initTime=1431040624805 2015-05-07 16:17:04,878 INFO [main] impl.DAGImpl: dag_1429683757595_0799_1 transitioned from NEW to INITED 2015-05-07 16:17:04,884 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1429683757595_0799_1][Event:DAG_STARTED]: dagID=dag_1429683757595_0799_1, startTime=1431040624883 2015-05-07 16:17:04,884 INFO [Dispatcher thread: Central] impl.DAGImpl: Added additional resources : [[]] to classpath 2015-05-07 16:17:04,885 INFO [Dispatcher thread: Central] impl.DAGImpl: dag_1429683757595_0799_1 transitioned from INITED to RUNNING 2015-05-07 16:17:04,886 INFO [Dispatcher thread: Central] impl.VertexImpl: Setting vertexManager to ImmediateStartVertexManager for vertex_1429683757595_0799_1_00 [map] 2015-05-07 16:17:04,894 INFO [Dispatcher thread: Central] impl.VertexImpl: Creating 1 tasks for vertex: vertex_1429683757595_0799_1_00 [map] 2015-05-07 16:17:04,907 ERROR [Dispatcher thread: Central] common.AsyncDispatcher: Error in dispatcher thread java.lang.NullPointerException: taskAttemptID is null at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:204) at org.apache.tez.runtime.api.impl.TaskSpec.init(TaskSpec.java:55) at org.apache.tez.dag.app.dag.impl.VertexImpl.createRemoteTaskSpec(VertexImpl.java:2178) at org.apache.tez.dag.app.dag.impl.VertexImpl.createTask(VertexImpl.java:2195) at org.apache.tez.dag.app.dag.impl.VertexImpl.createTasks(VertexImpl.java:2200) at org.apache.tez.dag.app.dag.impl.VertexImpl.access$4500(VertexImpl.java:196) at org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.handleInitEvent(VertexImpl.java:3207) at org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.transition(VertexImpl.java:3129) at org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.transition(VertexImpl.java:3110) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57) at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1748) at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:195) at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1938) at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1924) at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) at java.lang.Thread.run(Thread.java:745)
Success: TEZ-776 PreCommit Build #650
Jira: https://issues.apache.org/jira/browse/TEZ-776 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/650/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2867 lines...] [INFO] Final Memory: 68M/866M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12731275/TEZ-776.15.patch against master revision a382324. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 8 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/650//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/650//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 70902dc5884b5c972a00f134c1a87ddcaafec793 logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #649 Archived 44 artifacts Archive block size is 32768 Received 4 blocks and 2646352 bytes Compression is 4.7% Took 1.5 sec Description set: TEZ-776 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2428) Investigate ignored event: DAGAppMaster: ignore event when DAGAppMaster is in the state of STOPPED, eventType=NEW_DAG_SUBMITTED
[ https://issues.apache.org/jira/browse/TEZ-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532910#comment-14532910 ] Siddharth Seth commented on TEZ-2428: - [~hitesh] - a little more context please. Couldn't figure out what this means from the linked build. Investigate ignored event: DAGAppMaster: ignore event when DAGAppMaster is in the state of STOPPED, eventType=NEW_DAG_SUBMITTED --- Key: TEZ-2428 URL: https://issues.apache.org/jira/browse/TEZ-2428 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah \cc [~sseth] From https://builds.apache.org/job/Tez-Build/1055/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2221) VertexGroup name should be unqiue
[ https://issues.apache.org/jira/browse/TEZ-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532993#comment-14532993 ] Rohini Palaniswamy commented on TEZ-2221: - https://issues.apache.org/jira/secure/attachment/12730678/TEZ-2221-5-revert.patch looks good. +1 for that patch. VertexGroup name should be unqiue - Key: TEZ-2221 URL: https://issues.apache.org/jira/browse/TEZ-2221 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.7.0, 0.5.4, 0.6.1 Attachments: TEZ-2221-1.patch, TEZ-2221-2.patch, TEZ-2221-3.patch, TEZ-2221-4.patch, TEZ-2221-5-revert.patch VertexGroupCommitStartedEvent VertexGroupCommitFinishedEvent use vertex group name to identify the vertex group commit, the same name of vertex group will conflict. While in the current equals hashCode of VertexGroup, vertex group name and members name are used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2404 PreCommit Build #648
Jira: https://issues.apache.org/jira/browse/TEZ-2404 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/648/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2638 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12731076/TEZ-2404-3.patch against master revision 02870f0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.test.TestFaultTolerance Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/648//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/648//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 0be151e4b9c7abb835891328d7cfd36825324a32 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #646 Archived 44 artifacts Archive block size is 32768 Received 4 blocks and 2628496 bytes Compression is 4.7% Took 1.6 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## 6 tests failed. FAILED: org.apache.tez.test.TestFaultTolerance.testBasicInputFailureWithExit Error Message: expected:SUCCEEDED but was:FAILED Stack Trace: java.lang.AssertionError: expected:SUCCEEDED but was:FAILED at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:135) at org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:114) at org.apache.tez.test.TestFaultTolerance.testBasicInputFailureWithExit(TestFaultTolerance.java:248) FAILED: org.apache.tez.test.TestFaultTolerance.testInputFailureRerunCanSendOutputToTwoDownstreamVertices Error Message: TezSession has already shutdown. No cluster diagnostics found. Stack Trace: org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown. No cluster diagnostics found. at org.apache.tez.client.TezClient.waitTillReady(TezClient.java:678) at org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:118) at org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:114) at org.apache.tez.test.TestFaultTolerance.testInputFailureRerunCanSendOutputToTwoDownstreamVertices(TestFaultTolerance.java:672) FAILED: org.apache.tez.test.TestFaultTolerance.testMultipleInputFailureWithoutExit Error Message: TezSession has already shutdown. No cluster diagnostics found. Stack Trace: org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown. No cluster diagnostics found. at org.apache.tez.client.TezClient.waitTillReady(TezClient.java:678) at org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:118) at org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:114) at org.apache.tez.test.TestFaultTolerance.testMultipleInputFailureWithoutExit(TestFaultTolerance.java:297) FAILED: org.apache.tez.test.TestFaultTolerance.testCascadingInputFailureWithExitSuccess Error Message: TezSession has already
[jira] [Commented] (TEZ-2404) Handle DataMovementEvent before its TaskAttemptCompletedEvent
[ https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532076#comment-14532076 ] TezQA commented on TEZ-2404: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12731076/TEZ-2404-3.patch against master revision 02870f0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.test.TestFaultTolerance Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/648//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/648//console This message is automatically generated. Handle DataMovementEvent before its TaskAttemptCompletedEvent - Key: TEZ-2404 URL: https://issues.apache.org/jira/browse/TEZ-2404 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Priority: Critical Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch, TEZ-2404-3.patch TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it would cause recovery issue. Recovery need that DataMovement event is handled before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in recovering and cause the its dependent tasks hang. 2 Ways to fix this issue. 1. Still route TaskAtttemptCompletedEvent in Vertex 2. route DataMovementEvent before TaskAttemptCompeltedEvent in TezTaskAttemptListener -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2416) Tez UI: Make tooltips display faster.
[ https://issues.apache.org/jira/browse/TEZ-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Ramachandran updated TEZ-2416: -- Summary: Tez UI: Make tooltips display faster. (was: TEZ-UI: Make tooltips display faster.) Tez UI: Make tooltips display faster. - Key: TEZ-2416 URL: https://issues.apache.org/jira/browse/TEZ-2416 Project: Apache Tez Issue Type: Bug Reporter: Sreenath Somarajapuram Assignee: Sreenath Somarajapuram Attachments: TEZ-2416.1.patch, TEZ-2416.2.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2423) Tez UI: Remove Attempt Index column from task-attempts page
[ https://issues.apache.org/jira/browse/TEZ-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532532#comment-14532532 ] Prakash Ramachandran commented on TEZ-2423: --- +1 commiting Tez UI: Remove Attempt Index column from task-attempts page Key: TEZ-2423 URL: https://issues.apache.org/jira/browse/TEZ-2423 Project: Apache Tez Issue Type: Bug Reporter: Sreenath Somarajapuram Assignee: Sreenath Somarajapuram Fix For: 0.7.0, 0.8.0 Attachments: TEZ-2423.1.patch Attempt Index and Attempt No serves the same purpose. -- This message was sent by Atlassian JIRA (v6.3.4#6332)