[jira] [Commented] (TEZ-4066) Upgrade servlet-api from 2.5 to 3.1.0
[ https://issues.apache.org/jira/browse/TEZ-4066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834205#comment-16834205 ] Kuhu Shukla commented on TEZ-4066: -- Looks good with a minor nit : Can we change the LICENSE files that document servlept-api.jar to the new name of the jar? Let me know if you have any thoughts [~jeagles] on the same. Thanks a lot for the report and the patch. > Upgrade servlet-api from 2.5 to 3.1.0 > - > > Key: TEZ-4066 > URL: https://issues.apache.org/jira/browse/TEZ-4066 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: TEZ-4066.001.patch > > > Oozie launcher jobs trying to launch Tez jobs now fail to render Oozie > Launcher Job AM due to both 2.5 (from tez) and 3.1.0 (from hadoop) > servlet-api both being in the classpath. Tez should sync with servlet api > version from tez master branch that only supports hadoop 3+ > {code} > 2019-04-30 14:53:02,747 WARN [qtp1213419524-119] > org.eclipse.jetty.server.HttpChannel: > java.lang.NoSuchMethodError: > javax.servlet.http.HttpServletRequest.isAsyncStarted()Z > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:688) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:224) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TEZ-4058) Changes for 0.9.2 release
[ https://issues.apache.org/jira/browse/TEZ-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla resolved TEZ-4058. -- Resolution: Fixed Fix Version/s: 0.10.1 > Changes for 0.9.2 release > - > > Key: TEZ-4058 > URL: https://issues.apache.org/jira/browse/TEZ-4058 > Project: Apache Tez > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Fix For: 0.10.1 > > Attachments: TEZ-4058.001.patch > > > Update Tez_DOAP.rtf, index.md etc. for 0.9.2 release. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4058) Changes for 0.9.2 release
[ https://issues.apache.org/jira/browse/TEZ-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805569#comment-16805569 ] Kuhu Shukla commented on TEZ-4058: -- Committing to master. > Changes for 0.9.2 release > - > > Key: TEZ-4058 > URL: https://issues.apache.org/jira/browse/TEZ-4058 > Project: Apache Tez > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-4058.001.patch > > > Update Tez_DOAP.rtf, index.md etc. for 0.9.2 release. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-4058) Changes for 0.9.2 release
[ https://issues.apache.org/jira/browse/TEZ-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-4058: - Attachment: TEZ-4058.001.patch > Changes for 0.9.2 release > - > > Key: TEZ-4058 > URL: https://issues.apache.org/jira/browse/TEZ-4058 > Project: Apache Tez > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-4058.001.patch > > > Update Tez_DOAP.rtf, index.md etc. for 0.9.2 release. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TEZ-4058) Changes for 0.9.2 release
Kuhu Shukla created TEZ-4058: Summary: Changes for 0.9.2 release Key: TEZ-4058 URL: https://issues.apache.org/jira/browse/TEZ-4058 Project: Apache Tez Issue Type: Bug Reporter: Kuhu Shukla Assignee: Kuhu Shukla Update Tez_DOAP.rtf, index.md etc. for 0.9.2 release. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4031) Support tez gitbox migration
[ https://issues.apache.org/jira/browse/TEZ-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796471#comment-16796471 ] Kuhu Shukla commented on TEZ-4031: -- I made a mistake while committing the patch for this and would revert the older version of the patch that went in and put v4 in. My +1 is/was for v4. Sorry about this. > Support tez gitbox migration > > > Key: TEZ-4031 > URL: https://issues.apache.org/jira/browse/TEZ-4031 > Project: Apache Tez > Issue Type: Task >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Fix For: 0.9.2, 0.10.0 > > Attachments: TEZ-4031.001.patch, TEZ-4031.002.patch, > TEZ-4031.003.patch, TEZ-4031.004.patch > > > {code} > $ git grep git-wip > Tez_DOAP.rdf: rdf:resource="https://git-wip-us.apache.org/repos/asf/tez.git"/> > Tez_DOAP.rdf: rdf:resource="https://git-wip-us.apache.org/repos/asf?p=tez.git"/> > docs/src/site/site.xml: href="https://git-wip-us.apache.org/repos/asf/tez.git; alt="Use git clone > https://git-wip-us.apache.org/repos/asf/tez.git; /> > pom.xml: > scm:git:https://git-wip-us.apache.org/repos/asf/tez.git > tez-api/src/test/java/org/apache/tez/common/TestVersionInfo.java: final > String scmUrl = "scm:git:https://git-wip-us.apache.org/repos/asf/tez.git;; > tez-api/src/test/resources/test1-version-info.properties:scmurl=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git > tez-api/src/test/resources/test3-version-info.properties:scmurl=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git > tez-ui/src/main/webapp/package.json:"url": > "https://git-wip-us.apache.org/repos/asf/tez.git; > {code} > In addition the cwiki needs to be updated > https://cwiki.apache.org/confluence/display/TEZ/Committer+Guide > https://cwiki.apache.org/confluence/display/TEZ/How+to+Contribute+to+Tez > https://cwiki.apache.org/confluence/display/TEZ/Making+a+TEZ+Release > https://cwiki.apache.org/confluence/display/TEZ/Updating+the+Tez+Website -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TEZ-4056) Update version for 0.9.2 release
[ https://issues.apache.org/jira/browse/TEZ-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla resolved TEZ-4056. -- Resolution: Fixed Committed to 0.9.2 Thanks [~jeagles] > Update version for 0.9.2 release > > > Key: TEZ-4056 > URL: https://issues.apache.org/jira/browse/TEZ-4056 > Project: Apache Tez > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Fix For: 0.9.2 > > Attachments: TEZ-4056.001.patch > > > Tracks the release process sub section: > {code:java} > mvn versions:set -DnewVersion="x.y.z" > {code}{code} > CC: [~jeagles] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TEZ-4056) Update version for 0.9.2 release
Kuhu Shukla created TEZ-4056: Summary: Update version for 0.9.2 release Key: TEZ-4056 URL: https://issues.apache.org/jira/browse/TEZ-4056 Project: Apache Tez Issue Type: Bug Reporter: Kuhu Shukla Assignee: Kuhu Shukla Fix For: 0.9.2 Tracks the release process sub section: {code:java} mvn versions:set -DnewVersion="x.y.z" {code}{code} CC: [~jeagles] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-4056) Update version for 0.9.2 release
[ https://issues.apache.org/jira/browse/TEZ-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-4056: - Attachment: TEZ-4056.001.patch > Update version for 0.9.2 release > > > Key: TEZ-4056 > URL: https://issues.apache.org/jira/browse/TEZ-4056 > Project: Apache Tez > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Fix For: 0.9.2 > > Attachments: TEZ-4056.001.patch > > > Tracks the release process sub section: > {code:java} > mvn versions:set -DnewVersion="x.y.z" > {code}{code} > CC: [~jeagles] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-3992) Update commons-codec from 1.4 to 1.11
[ https://issues.apache.org/jira/browse/TEZ-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3992: - Fix Version/s: (was: 0.9.next) > Update commons-codec from 1.4 to 1.11 > - > > Key: TEZ-3992 > URL: https://issues.apache.org/jira/browse/TEZ-3992 > Project: Apache Tez > Issue Type: Bug >Reporter: Laszlo Bodor >Priority: Major > Attachments: TEZ-3992.01.patch > > > Commons codec 1.4 is from 2009, maybe we should try an update. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-3994) Upgrade maven-surefire-plugin to 0.21.0 to support yetus
[ https://issues.apache.org/jira/browse/TEZ-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3994: - Fix Version/s: 0.9.2 > Upgrade maven-surefire-plugin to 0.21.0 to support yetus > > > Key: TEZ-3994 > URL: https://issues.apache.org/jira/browse/TEZ-3994 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Fix For: 0.9.next, 0.9.2, 0.10.1 > > Attachments: TEZ-3994.001.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-3992) Update commons-codec from 1.4 to 1.11
[ https://issues.apache.org/jira/browse/TEZ-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3992: - Fix Version/s: 0.10.1 > Update commons-codec from 1.4 to 1.11 > - > > Key: TEZ-3992 > URL: https://issues.apache.org/jira/browse/TEZ-3992 > Project: Apache Tez > Issue Type: Bug >Reporter: Laszlo Bodor >Priority: Major > Fix For: 0.10.1 > > Attachments: TEZ-3992.01.patch > > > Commons codec 1.4 is from 2009, maybe we should try an update. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-3960) Better error handling in proto history logger and add doAs support.
[ https://issues.apache.org/jira/browse/TEZ-3960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3960: - Fix Version/s: 0.9.2 > Better error handling in proto history logger and add doAs support. > --- > > Key: TEZ-3960 > URL: https://issues.apache.org/jira/browse/TEZ-3960 > Project: Apache Tez > Issue Type: Bug >Reporter: Harish Jaiprakash >Assignee: Harish Jaiprakash >Priority: Major > Fix For: 0.9.next, 0.9.2, 0.10.0 > > Attachments: TEZ-3960.01.patch, TEZ-3960.02.patch > > > DagManifestScanner gets stuck for a days logs if there are errors in them. > Fix it using fixed number of retries. > The scanner should be able to use doAs to ensure it can read files if run > using a proxy admin user. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-3915) Create protobuf based history event logger.
[ https://issues.apache.org/jira/browse/TEZ-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3915: - Fix Version/s: 0.9.2 > Create protobuf based history event logger. > --- > > Key: TEZ-3915 > URL: https://issues.apache.org/jira/browse/TEZ-3915 > Project: Apache Tez > Issue Type: Improvement >Reporter: Harish Jaiprakash >Assignee: Harish Jaiprakash >Priority: Major > Fix For: 0.9.next, 0.9.2 > > Attachments: TEZ-3915.01.patch, TEZ-3915.02.patch, TEZ-3915.03.patch, > TEZ-3915.04.patch, TEZ-3915.05.patch, TEZ-3915.06.patch, TEZ-3915.07.patch > > > A protobuf based history event logger, to log directly into hdfs. Implement a > reader api also, to get the events from the files. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-4002) CHANGES.txt for 0.9.2 Release
[ https://issues.apache.org/jira/browse/TEZ-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-4002: - Fix Version/s: 0.9.2 > CHANGES.txt for 0.9.2 Release > - > > Key: TEZ-4002 > URL: https://issues.apache.org/jira/browse/TEZ-4002 > Project: Apache Tez > Issue Type: Task >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Fix For: 0.9.2 > > Attachments: TEZ-4002.001.patch, TEZ-4002.002.patch, > TEZ-4002.003.patch > > > Add CHANGES.txt for 0.9.2 line. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4002) CHANGES.txt for 0.9.2 Release
[ https://issues.apache.org/jira/browse/TEZ-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795143#comment-16795143 ] Kuhu Shukla commented on TEZ-4002: -- Updated CHANGES.txt. Steps: 1. {code} git log rel/release-0.9.1..origin/branch-0.9 --oneline | cut -d ' ' -f2- > temp.CHANGES.txt {code} 2. Take CHANGES. txt from old release (in this case 0.9.1) 3. Add contents of temp.CHANGES.txt to CHANGES.txt Will update documentation if this sounds right. [~jeagles], request for your feedback. Thank you! > CHANGES.txt for 0.9.2 Release > - > > Key: TEZ-4002 > URL: https://issues.apache.org/jira/browse/TEZ-4002 > Project: Apache Tez > Issue Type: Task >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-4002.001.patch, TEZ-4002.002.patch, > TEZ-4002.003.patch > > > Add CHANGES.txt for 0.9.2 line. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-4002) CHANGES.txt for 0.9.2 Release
[ https://issues.apache.org/jira/browse/TEZ-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-4002: - Attachment: TEZ-4002.003.patch > CHANGES.txt for 0.9.2 Release > - > > Key: TEZ-4002 > URL: https://issues.apache.org/jira/browse/TEZ-4002 > Project: Apache Tez > Issue Type: Task >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-4002.001.patch, TEZ-4002.002.patch, > TEZ-4002.003.patch > > > Add CHANGES.txt for 0.9.2 line. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4052) Fit dot files ASF License issues - part 2
[ https://issues.apache.org/jira/browse/TEZ-4052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16793007#comment-16793007 ] Kuhu Shukla commented on TEZ-4052: -- +1. lgtm. committing to master and branch-0.9. > Fit dot files ASF License issues - part 2 > - > > Key: TEZ-4052 > URL: https://issues.apache.org/jira/browse/TEZ-4052 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: TEZ-4052.001.patch > > > Continuing the effort in TEZ-3995. > https://issues.apache.org/jira/browse/TEZ-3995?focusedCommentId=16784595=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16784595 > {code} > 1) Please extend this to tez-ext-service-tests 2) Also, please consider > directory tez.log.dir with path ${project.build.directory}/logs. > {code} > This jira is to making sure all dot files are correctly placed under target > directory as to 1) make sure file aren't created outside the build directory > and 2) and named as part of a broader test directory design -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4052) Fit dot files ASF License issues - part 2
[ https://issues.apache.org/jira/browse/TEZ-4052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792996#comment-16792996 ] Kuhu Shukla commented on TEZ-4052: -- Sorry, I needed a rebase. Reviewing the patch now. > Fit dot files ASF License issues - part 2 > - > > Key: TEZ-4052 > URL: https://issues.apache.org/jira/browse/TEZ-4052 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: TEZ-4052.001.patch > > > Continuing the effort in TEZ-3995. > https://issues.apache.org/jira/browse/TEZ-3995?focusedCommentId=16784595=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16784595 > {code} > 1) Please extend this to tez-ext-service-tests 2) Also, please consider > directory tez.log.dir with path ${project.build.directory}/logs. > {code} > This jira is to making sure all dot files are correctly placed under target > directory as to 1) make sure file aren't created outside the build directory > and 2) and named as part of a broader test directory design -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4052) Fit dot files ASF License issues - part 2
[ https://issues.apache.org/jira/browse/TEZ-4052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792106#comment-16792106 ] Kuhu Shukla commented on TEZ-4052: -- I tried applying the patch to master and it was not clean and failed. [~jeagles] could you take a look and let me know what I am missing? > Fit dot files ASF License issues - part 2 > - > > Key: TEZ-4052 > URL: https://issues.apache.org/jira/browse/TEZ-4052 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: TEZ-4052.001.patch > > > Continuing the effort in TEZ-3995. > https://issues.apache.org/jira/browse/TEZ-3995?focusedCommentId=16784595=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16784595 > {code} > 1) Please extend this to tez-ext-service-tests 2) Also, please consider > directory tez.log.dir with path ${project.build.directory}/logs. > {code} > This jira is to making sure all dot files are correctly placed under target > directory as to 1) make sure file aren't created outside the build directory > and 2) and named as part of a broader test directory design -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TEZ-4053) TestExceptionPropagation.testExceptionPropagationSession fails intermittently
Kuhu Shukla created TEZ-4053: Summary: TestExceptionPropagation.testExceptionPropagationSession fails intermittently Key: TEZ-4053 URL: https://issues.apache.org/jira/browse/TEZ-4053 Project: Apache Tez Issue Type: Bug Reporter: Kuhu Shukla Assignee: Kuhu Shukla Irrespective of the test case failing or not I see the below stack trace that needs to be looked at: {code} 2019-03-13 15:12:19,457 [INFO] [Dispatcher thread {Central}] |HistoryEventHandler.criticalEvents|: [HISTORY][DAG:dag_1552507929653_0001_1][Event:DAG_FINISHED]: dagId=dag_1552507929653_0001_1, startTime=1552507939233, finishTime=1552507939419, timeTaken=186, status=FAILED, diagnostics=Vertex failed, vertexName=v1, vertexId=vertex_1552507929653_0001_1_00, diagnostics=[Exception in EdgeManager, vertex:vertex_1552507929653_0001_1_00 [v1],java.lang.RuntimeException: EM_GetNumSourceTaskPhysicalOutputs at org.apache.tez.test.TestExceptionPropagation$CustomEdgeManager.getNumSourceTaskPhysicalOutputs(TestExceptionPropagation.java:828) at org.apache.tez.dag.app.dag.impl.Edge.getSourceSpec(Edge.java:340) at org.apache.tez.dag.app.dag.impl.VertexImpl.getSourceSpecFor(VertexImpl.java:4489) at org.apache.tez.dag.app.dag.impl.VertexImpl.getOutputSpecList(VertexImpl.java:4477) at org.apache.tez.dag.app.dag.impl.VertexImpl.createRemoteTaskSpec(VertexImpl.java:1647) at org.apache.tez.dag.app.dag.impl.VertexImpl.scheduleTasks(VertexImpl.java:1683) ] completed., numCompletedVertices=2, numSuccessfulVertices=0, numFailedVertices=1, numKilledVertices=1, numVertices=2 2019-03-13 15:12:19,418 [INFO] [Dispatcher thread {Central}] |impl.DAGImpl|: Checking vertices for DAG completion, num CompletedVertices=2, numSuccessfulVertices=0, numFailedVertices=1, numKilledVertices=1, numVertices=2, commitInProgres s=0, terminationCause=VERTEX_FAILURE 2019-03-13 15:12:19,418 [INFO] [Dispatcher thread {Central}] |impl.DAGImpl|: DAG did not succeed due to VERTEX_FAILURE . failedVertices:1 killedVertices:1 2019-03-13 15:12:19,443 [INFO] [Dispatcher thread {Central}] |recovery.RecoveryService|: DAG completed, dagId=dag_1552 507929653_0001_1, queueSize=0 2019-03-13 15:12:19,457 [INFO] [Dispatcher thread {Central}] |HistoryEventHandler.criticalEvents|: [HISTORY][DAG:dag_1 552507929653_0001_1][Event:DAG_FINISHED]: dagId=dag_1552507929653_0001_1, startTime=1552507939233, finishTime=15525079 39419, timeTaken=186, status=FAILED, diagnostics=Vertex failed, vertexName=v1, vertexId=vertex_1552507929653_0001_1_00 , diagnostics=[Exception in EdgeManager, vertex:vertex_1552507929653_0001_1_00 [v1],java.lang.RuntimeException: EM_Get NumSourceTaskPhysicalOutputs at org.apache.tez.test.TestExceptionPropagation$CustomEdgeManager.getNumSourceTaskPhysicalOutputs(TestExceptio nPropagation.java:828) at org.apache.tez.dag.app.dag.impl.Edge.getSourceSpec(Edge.java:340) at org.apache.tez.dag.app.dag.impl.VertexImpl.getSourceSpecFor(VertexImpl.java:4489) at org.apache.tez.dag.app.dag.impl.VertexImpl.getOutputSpecList(VertexImpl.java:4477) at org.apache.tez.dag.app.dag.impl.VertexImpl.createRemoteTaskSpec(VertexImpl.java:1647) at org.apache.tez.dag.app.dag.impl.VertexImpl.scheduleTasks(VertexImpl.java:1683) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerPluginContextImpl.scheduleTasks(VertexManager.java:221) at org.apache.tez.dag.app.dag.impl.RootInputVertexManager.schedulePendingTasks(RootInputVertexManager.java:386) at org.apache.tez.dag.app.dag.impl.RootInputVertexManager.processPendingTasks(RootInputVertexManager.java:379) at org.apache.tez.dag.app.dag.impl.RootInputVertexManager.onVertexStarted(RootInputVertexManager.java:182) at org.apache.tez.test.TestExceptionPropagation$RootInputVertexManagerWithException.onVertexStarted(TestExceptionPropagation.java:714) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEventOnVertexStarted.invoke(VertexManager.java:618) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:687) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:682) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent.call(VertexManager.java:682) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent.call(VertexManager.java:671) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
[jira] [Commented] (TEZ-4031) Support tez gitbox migration
[ https://issues.apache.org/jira/browse/TEZ-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785856#comment-16785856 ] Kuhu Shukla commented on TEZ-4031: -- +1. lgtm. Committing in a bit. > Support tez gitbox migration > > > Key: TEZ-4031 > URL: https://issues.apache.org/jira/browse/TEZ-4031 > Project: Apache Tez > Issue Type: Task >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: TEZ-4031.001.patch, TEZ-4031.002.patch, > TEZ-4031.003.patch, TEZ-4031.004.patch > > > {code} > $ git grep git-wip > Tez_DOAP.rdf: rdf:resource="https://git-wip-us.apache.org/repos/asf/tez.git"/> > Tez_DOAP.rdf: rdf:resource="https://git-wip-us.apache.org/repos/asf?p=tez.git"/> > docs/src/site/site.xml: href="https://git-wip-us.apache.org/repos/asf/tez.git; alt="Use git clone > https://git-wip-us.apache.org/repos/asf/tez.git; /> > pom.xml: > scm:git:https://git-wip-us.apache.org/repos/asf/tez.git > tez-api/src/test/java/org/apache/tez/common/TestVersionInfo.java: final > String scmUrl = "scm:git:https://git-wip-us.apache.org/repos/asf/tez.git;; > tez-api/src/test/resources/test1-version-info.properties:scmurl=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git > tez-api/src/test/resources/test3-version-info.properties:scmurl=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git > tez-ui/src/main/webapp/package.json:"url": > "https://git-wip-us.apache.org/repos/asf/tez.git; > {code} > In addition the cwiki needs to be updated > https://cwiki.apache.org/confluence/display/TEZ/Committer+Guide > https://cwiki.apache.org/confluence/display/TEZ/How+to+Contribute+to+Tez > https://cwiki.apache.org/confluence/display/TEZ/Making+a+TEZ+Release > https://cwiki.apache.org/confluence/display/TEZ/Updating+the+Tez+Website -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4031) Support tez gitbox migration
[ https://issues.apache.org/jira/browse/TEZ-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781877#comment-16781877 ] Kuhu Shukla commented on TEZ-4031: -- [~jeagles], thanks for the patch. Could you help me get level set with part of the patch that does version change for maven-site-plugin and maven-project-info-reports-plugin? Is there a HADOOP specific JIRA that has more context? Thanks! > Support tez gitbox migration > > > Key: TEZ-4031 > URL: https://issues.apache.org/jira/browse/TEZ-4031 > Project: Apache Tez > Issue Type: Task >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: TEZ-4031.001.patch, TEZ-4031.002.patch > > > {code} > $ git grep git-wip > Tez_DOAP.rdf: rdf:resource="https://git-wip-us.apache.org/repos/asf/tez.git"/> > Tez_DOAP.rdf: rdf:resource="https://git-wip-us.apache.org/repos/asf?p=tez.git"/> > docs/src/site/site.xml: href="https://git-wip-us.apache.org/repos/asf/tez.git; alt="Use git clone > https://git-wip-us.apache.org/repos/asf/tez.git; /> > pom.xml: > scm:git:https://git-wip-us.apache.org/repos/asf/tez.git > tez-api/src/test/java/org/apache/tez/common/TestVersionInfo.java: final > String scmUrl = "scm:git:https://git-wip-us.apache.org/repos/asf/tez.git;; > tez-api/src/test/resources/test1-version-info.properties:scmurl=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git > tez-api/src/test/resources/test3-version-info.properties:scmurl=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git > tez-ui/src/main/webapp/package.json:"url": > "https://git-wip-us.apache.org/repos/asf/tez.git; > {code} > In addition the cwiki needs to be updated > https://cwiki.apache.org/confluence/display/TEZ/Committer+Guide > https://cwiki.apache.org/confluence/display/TEZ/How+to+Contribute+to+Tez > https://cwiki.apache.org/confluence/display/TEZ/Making+a+TEZ+Release > https://cwiki.apache.org/confluence/display/TEZ/Updating+the+Tez+Website -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4031) Support tez gitbox migration
[ https://issues.apache.org/jira/browse/TEZ-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781847#comment-16781847 ] Kuhu Shukla commented on TEZ-4031: -- Taking a look. > Support tez gitbox migration > > > Key: TEZ-4031 > URL: https://issues.apache.org/jira/browse/TEZ-4031 > Project: Apache Tez > Issue Type: Task >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: TEZ-4031.001.patch, TEZ-4031.002.patch > > > {code} > $ git grep git-wip > Tez_DOAP.rdf: rdf:resource="https://git-wip-us.apache.org/repos/asf/tez.git"/> > Tez_DOAP.rdf: rdf:resource="https://git-wip-us.apache.org/repos/asf?p=tez.git"/> > docs/src/site/site.xml: href="https://git-wip-us.apache.org/repos/asf/tez.git; alt="Use git clone > https://git-wip-us.apache.org/repos/asf/tez.git; /> > pom.xml: > scm:git:https://git-wip-us.apache.org/repos/asf/tez.git > tez-api/src/test/java/org/apache/tez/common/TestVersionInfo.java: final > String scmUrl = "scm:git:https://git-wip-us.apache.org/repos/asf/tez.git;; > tez-api/src/test/resources/test1-version-info.properties:scmurl=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git > tez-api/src/test/resources/test3-version-info.properties:scmurl=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git > tez-ui/src/main/webapp/package.json:"url": > "https://git-wip-us.apache.org/repos/asf/tez.git; > {code} > In addition the cwiki needs to be updated > https://cwiki.apache.org/confluence/display/TEZ/Committer+Guide > https://cwiki.apache.org/confluence/display/TEZ/How+to+Contribute+to+Tez > https://cwiki.apache.org/confluence/display/TEZ/Making+a+TEZ+Release > https://cwiki.apache.org/confluence/display/TEZ/Updating+the+Tez+Website -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3995) Fix dot files produced by tests to prevent ASF license warnings in yetus
[ https://issues.apache.org/jira/browse/TEZ-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781845#comment-16781845 ] Kuhu Shukla commented on TEZ-3995: -- [~jeagles] Can you take a look the latest patch [~jmarhuen] posted here? Let me know if any inputs are required. Thanks a lot! > Fix dot files produced by tests to prevent ASF license warnings in yetus > > > Key: TEZ-3995 > URL: https://issues.apache.org/jira/browse/TEZ-3995 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Jaume M >Priority: Major > Fix For: 0.9.2, 0.10.0 > > Attachments: TEZ-3995.1-branch-0.9.patch, TEZ-3995.1.patch, > TEZ-3995.1.patch, TEZ-3995.2.patch > > > From > https://builds.apache.org/job/PreCommit-TEZ-Build-Yetus/10/artifact/out/patch-asflicense-problems.txt > {code} > Lines that start with ? in the ASF License report indicate files that do > not have an Apache license header: !? > /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build-Yetus/sourcedir/tez-plugins/tez-aux-services/build/test/data/dfs/name1/current/seen_txid > !? > /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build-Yetus/sourcedir/tez-plugins/tez-aux-services/build/test/data/dfs/name1/current/fsimage_000.md5 > !? > /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build-Yetus/sourcedir/tez-plugins/tez-aux-services/build/test/data/dfs/name1/current/fsimage_000 > !? > /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build-Yetus/sourcedir/tez-plugins/tez-aux-services/build/test/data/dfs/name1/current/VERSION > !? > /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build-Yetus/sourcedir/tez-plugins/tez-aux-services/build/test/data/dfs/name2/current/seen_txid > !? > /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build-Yetus/sourcedir/tez-plugins/tez-aux-services/build/test/data/dfs/name2/current/fsimage_000.md5 > !? > /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build-Yetus/sourcedir/tez-plugins/tez-aux-services/build/test/data/dfs/name2/current/fsimage_000 > !? > /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build-Yetus/sourcedir/tez-plugins/tez-aux-services/build/test/data/dfs/name2/current/VERSION > !? > /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build-Yetus/sourcedir/tez-plugins/tez-aux-services/build/test/data/dfs/data/data4/current/BP-1822420483-67.195.81.145-1538067940282/scanner.cursor > !? > /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build-Yetus/sourcedir/tez-plugins/tez-aux-services/build/test/data/dfs/data/data4/current/BP-1822420483-67.195.81.145-1538067940282/current/dfsUsed > !? > /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build-Yetus/sourcedir/tez-plugins/tez-aux-services/build/test/data/dfs/data/data4/current/BP-1822420483-67.195.81.145-1538067940282/current/finalized/subdir0/subdir0/blk_1073741827 > !? > /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build-Yetus/sourcedir/tez-plugins/tez-aux-services/build/test/data/dfs/data/data4/current/BP-1822420483-67.195.81.145-1538067940282/current/VERSION > !? > /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build-Yetus/sourcedir/tez-plugins/tez-aux-services/build/test/data/dfs/data/data4/current/VERSION > !? > /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build-Yetus/sourcedir/tez-plugins/tez-aux-services/build/test/data/dfs/data/data9/current/BP-1822420483-67.195.81.145-1538067940282/scanner.cursor > !? > /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build-Yetus/sourcedir/tez-plugins/tez-aux-services/build/test/data/dfs/data/data9/current/BP-1822420483-67.195.81.145-1538067940282/current/dfsUsed > !? > /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build-Yetus/sourcedir/tez-plugins/tez-aux-services/build/test/data/dfs/data/data9/current/BP-1822420483-67.195.81.145-1538067940282/current/finalized/subdir0/subdir0/blk_1073741826 > !? > /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build-Yetus/sourcedir/tez-plugins/tez-aux-services/build/test/data/dfs/data/data9/current/BP-1822420483-67.195.81.145-1538067940282/current/VERSION > !? > /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build-Yetus/sourcedir/tez-plugins/tez-aux-services/build/test/data/dfs/data/data9/current/VERSION > !? > /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build-Yetus/sourcedir/tez-plugins/tez-aux-services/build/test/data/dfs/data/data10/current/BP-1822420483-67.195.81.145-1538067940282/scanner.cursor > !? >
[jira] [Commented] (TEZ-4002) CHANGES.txt for 0.9.2 Release
[ https://issues.apache.org/jira/browse/TEZ-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780926#comment-16780926 ] Kuhu Shukla commented on TEZ-4002: -- [~jeagles] could you help take a look? Appreciate it. > CHANGES.txt for 0.9.2 Release > - > > Key: TEZ-4002 > URL: https://issues.apache.org/jira/browse/TEZ-4002 > Project: Apache Tez > Issue Type: Task >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-4002.001.patch, TEZ-4002.002.patch > > > Add CHANGES.txt for 0.9.2 line. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-4002) CHANGES.txt for 0.9.2 Release
[ https://issues.apache.org/jira/browse/TEZ-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-4002: - Attachment: TEZ-4002.002.patch > CHANGES.txt for 0.9.2 Release > - > > Key: TEZ-4002 > URL: https://issues.apache.org/jira/browse/TEZ-4002 > Project: Apache Tez > Issue Type: Task >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-4002.001.patch, TEZ-4002.002.patch > > > Add CHANGES.txt for 0.9.2 line. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4047) Tez trademark in xml is causing xml parsing issue
[ https://issues.apache.org/jira/browse/TEZ-4047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780757#comment-16780757 ] Kuhu Shukla commented on TEZ-4047: -- +1. lgtm. Committing this. > Tez trademark in xml is causing xml parsing issue > - > > Key: TEZ-4047 > URL: https://issues.apache.org/jira/browse/TEZ-4047 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: TEZ-4047.001.patch > > > {code} > docs/src/site/site.xml: > [Fatal Error] site.xml:97:34: The entity "reg" was referenced, but not > declared. > java.lang.RuntimeException: org.xml.sax.SAXParseException; systemId: > file:/testptch/tez/./docs/src/site/site.xml; lineNumber: 97; columnNumber: > 34; The entity "reg" was referenced, but not declared. > at > jdk.nashorn.internal.runtime.ScriptRuntime.apply(ScriptRuntime.java:397) > at > jdk.nashorn.api.scripting.NashornScriptEngine.evalImpl(NashornScriptEngine.java:449) > at > jdk.nashorn.api.scripting.NashornScriptEngine.evalImpl(NashornScriptEngine.java:406) > at > jdk.nashorn.api.scripting.NashornScriptEngine.evalImpl(NashornScriptEngine.java:402) > at > jdk.nashorn.api.scripting.NashornScriptEngine.eval(NashornScriptEngine.java:155) > at javax.script.AbstractScriptEngine.eval(AbstractScriptEngine.java:264) > at com.sun.tools.script.shell.Main.evaluateString(Main.java:298) > at com.sun.tools.script.shell.Main.evaluateString(Main.java:319) > at com.sun.tools.script.shell.Main.access$300(Main.java:37) > at com.sun.tools.script.shell.Main$3.run(Main.java:217) > at com.sun.tools.script.shell.Main.main(Main.java:48) > Caused by: org.xml.sax.SAXParseException; systemId: > file:/testptch/tez/./docs/src/site/site.xml; lineNumber: 97; columnNumber: > 34; The entity "reg" was referenced, but not declared. > at > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257) > at > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339) > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:205) > at > jdk.nashorn.internal.scripts.Script$Recompilation$2$19313A$\^system_init\_.XMLDocument(:747) > at jdk.nashorn.internal.scripts.Script$1$\^string\_.:program(:1) > at > jdk.nashorn.internal.runtime.ScriptFunctionData.invoke(ScriptFunctionData.java:637) > at > jdk.nashorn.internal.runtime.ScriptFunction.invoke(ScriptFunction.java:494) > at > jdk.nashorn.internal.runtime.ScriptRuntime.apply(ScriptRuntime.java:393) > ... 10 more > {code} > Also output from xmllint verifies xml issue as well. > {code} > xmllint ./docs/src/site/site.xml > .//src/site/site.xml:97: parser error : Entity 'reg' not defined > http://tez.apache.org/"/> > ^ > .//src/site/site.xml:123: parser error : Entity 'reg' not defined > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4050) maven site is failing due to missing configuration.
[ https://issues.apache.org/jira/browse/TEZ-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780750#comment-16780750 ] Kuhu Shukla commented on TEZ-4050: -- +1. lgtm. > maven site is failing due to missing configuration. > --- > > Key: TEZ-4050 > URL: https://issues.apache.org/jira/browse/TEZ-4050 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: TEZ-4050.001.patch > > > {code} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-site-plugin:3.4:stage (default-cli) on project > tez-docs: Missing site information in the distribution management of the > project Tez (org.apache.tez:tez-docs:0.10.1-SNAPSHOT) -> [Help 1] > {code} > From maven site plugin usage we can see we are missing configuration. > https://maven.apache.org/plugins/maven-site-plugin/usage.html > {code} > > ... > > > www.yourcompany.com > scp://www.yourcompany.com/www/docs/project/ > > > ... > > {code} > Tez does not use this url to deploy and neither does hadoop. But it is needed > to stage site documentation. url is only used during site:deploy which is > never called during Tez QA step. > This jira aims to provide a place holder (the same as hadoop) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4049) Fix findbugs issues in NotRunningJob
[ https://issues.apache.org/jira/browse/TEZ-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780707#comment-16780707 ] Kuhu Shukla commented on TEZ-4049: -- Thank you for the patch [~jeagles]! Committed to branch-0.9, master. > Fix findbugs issues in NotRunningJob > > > Key: TEZ-4049 > URL: https://issues.apache.org/jira/browse/TEZ-4049 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Fix For: 0.9.2, 0.10.1 > > Attachments: TEZ-4049.001.patch > > > Introduced by TEZ-4035. Remove fixes while keeping 3.2.0 api compatibility. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4049) Fix findbugs issues in NotRunningJob
[ https://issues.apache.org/jira/browse/TEZ-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780692#comment-16780692 ] Kuhu Shukla commented on TEZ-4049: -- +1. lgtm. Committing to master, branch-0.9. > Fix findbugs issues in NotRunningJob > > > Key: TEZ-4049 > URL: https://issues.apache.org/jira/browse/TEZ-4049 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: TEZ-4049.001.patch > > > Introduced by TEZ-4035. Remove fixes while keeping 3.2.0 api compatibility. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4043) Create a yetus compatible checkstyle configuration
[ https://issues.apache.org/jira/browse/TEZ-4043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16769997#comment-16769997 ] Kuhu Shukla commented on TEZ-4043: -- Thank you [~jeagles] for the patch! Committed to master, branch-0.9. > Create a yetus compatible checkstyle configuration > -- > > Key: TEZ-4043 > URL: https://issues.apache.org/jira/browse/TEZ-4043 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Fix For: 0.9.2, 0.10.1 > > Attachments: TEZ-4043.001.patch, TEZ-4043.002.patch > > > Tez follows Hadoop source code guidelines with the exception of 120 character > line length. > http://maven.apache.org/plugins/maven-checkstyle-plugin/examples/multi-module-config.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4043) Create a yetus compatible checkstyle configuration
[ https://issues.apache.org/jira/browse/TEZ-4043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16769774#comment-16769774 ] Kuhu Shukla commented on TEZ-4043: -- +1. lgtm. Committing to master and branch-0.9. > Create a yetus compatible checkstyle configuration > -- > > Key: TEZ-4043 > URL: https://issues.apache.org/jira/browse/TEZ-4043 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: TEZ-4043.001.patch, TEZ-4043.002.patch > > > Tez follows Hadoop source code guidelines with the exception of 120 character > line length. > http://maven.apache.org/plugins/maven-checkstyle-plugin/examples/multi-module-config.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4004) Update jetty9 to align with Hadoop and Hive
[ https://issues.apache.org/jira/browse/TEZ-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16768533#comment-16768533 ] Kuhu Shukla commented on TEZ-4004: -- [~jeagles], precommit looks good. +1. Committing this shortly. > Update jetty9 to align with Hadoop and Hive > --- > > Key: TEZ-4004 > URL: https://issues.apache.org/jira/browse/TEZ-4004 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Fix For: 0.10.0, 0.10.1 > > Attachments: TEZ-4004.001-branch-0.9.patch, TEZ-4004.001.patch > > > https://abi-laboratory.pro/index.php?view=timeline=java=jetty > https://issues.apache.org/jira/browse/HADOOP-15815 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4004) Update jetty9 to align with Hadoop and Hive
[ https://issues.apache.org/jira/browse/TEZ-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16767598#comment-16767598 ] Kuhu Shukla commented on TEZ-4004: -- Triggered pre-commit for this to see if TEZ-4041 fix changes anything. > Update jetty9 to align with Hadoop and Hive > --- > > Key: TEZ-4004 > URL: https://issues.apache.org/jira/browse/TEZ-4004 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Fix For: 0.10.0, 0.10.1 > > Attachments: TEZ-4004.001-branch-0.9.patch, TEZ-4004.001.patch > > > https://abi-laboratory.pro/index.php?view=timeline=java=jetty > https://issues.apache.org/jira/browse/HADOOP-15815 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4041) TestExtServicesWithLocalMode fails in docker
[ https://issues.apache.org/jira/browse/TEZ-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16767584#comment-16767584 ] Kuhu Shukla commented on TEZ-4041: -- +1. lgtm. Committing this to master and branch-0.9. [~jeagles] thank you for the patch. > TestExtServicesWithLocalMode fails in docker > > > Key: TEZ-4041 > URL: https://issues.apache.org/jira/browse/TEZ-4041 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: TEZ-4041.001.patch > > > {code} > 2019-02-13 00:24:33,703 INFO [DAGAppMaster Thread] service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.tez.dag.app.DAGAppMaster failed in state INITED > org.apache.tez.dag.api.TezUncheckedException: > java.lang.reflect.InvocationTargetException > at > org.apache.tez.dag.app.TaskCommunicatorManager.createCustomTaskCommunicator(TaskCommunicatorManager.java:215) > at > org.apache.tez.dag.app.TaskCommunicatorManager.createTaskCommunicator(TaskCommunicatorManager.java:184) > at > org.apache.tez.dag.app.TaskCommunicatorManager.(TaskCommunicatorManager.java:152) > at > org.apache.tez.dag.app.DAGAppMaster.createTaskCommunicatorManager(DAGAppMaster.java:1088) > at > org.apache.tez.dag.app.DAGAppMaster.serviceInit(DAGAppMaster.java:532) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at org.apache.tez.dag.app.DAGAppMaster$9.run(DAGAppMaster.java:2606) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686) > at > org.apache.tez.dag.app.DAGAppMaster.initAndStartAppMaster(DAGAppMaster.java:2603) > at org.apache.tez.client.LocalClient$1.run(LocalClient.java:327) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.tez.dag.app.TaskCommunicatorManager.createCustomTaskCommunicator(TaskCommunicatorManager.java:213) > ... 12 more > Caused by: java.lang.NullPointerException > at > org.apache.tez.test.service.rpc.TezTestServiceProtocolProtos$SubmitWorkRequestProto$Builder.setUser(TezTestServiceProtocolProtos.java:5549) > at > org.apache.tez.dag.app.taskcomm.TezTestServiceTaskCommunicatorImpl.(TezTestServiceTaskCommunicatorImpl.java:65) > ... 17 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4040) Upgrade RoaringBitmap version to avoid NoSuchMethodError
[ https://issues.apache.org/jira/browse/TEZ-4040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16767462#comment-16767462 ] Kuhu Shukla commented on TEZ-4040: -- The test failure [~jeagles] will fix as a separate JIRA. I am committing v1 of this patch to master and branch-0.9. > Upgrade RoaringBitmap version to avoid NoSuchMethodError > > > Key: TEZ-4040 > URL: https://issues.apache.org/jira/browse/TEZ-4040 > Project: Apache Tez > Issue Type: Task >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: 0.4.9.api.txt, 0.5.11.api.txt, 0.5.21.api.txt, > TEZ-4040.001.patch, TEZ-4040.002.patch > > > a common request is to use the runOptimize function which is present is later > versions of roaringbitmap -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-4037) Add back DAG search status KILLED
[ https://issues.apache.org/jira/browse/TEZ-4037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-4037: - Fix Version/s: 0.10.1 0.9.2 > Add back DAG search status KILLED > -- > > Key: TEZ-4037 > URL: https://issues.apache.org/jira/browse/TEZ-4037 > Project: Apache Tez > Issue Type: Task > Components: UI >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Fix For: 0.9.2, 0.10.1 > > Attachments: TEZ-4037.001.patch > > > https://issues.apache.org/jira/browse/TEZ-2447 removed KILLED since sometimes > this status can fail to search all KILLED DAGs. This jira re-adds KILLED dag > status search since it still has value and would rather focus on fixing the > DAGs who fail to write killed status to history log file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4037) Add back DAG search status KILLED
[ https://issues.apache.org/jira/browse/TEZ-4037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766345#comment-16766345 ] Kuhu Shukla commented on TEZ-4037: -- Thank you for the report and patch [~jeagles]. Took a look at how it was done in TEZ-2447 and this change makes sense. +1. Will commit this shortly to master and branch-0.9. > Add back DAG search status KILLED > -- > > Key: TEZ-4037 > URL: https://issues.apache.org/jira/browse/TEZ-4037 > Project: Apache Tez > Issue Type: Task > Components: UI >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: TEZ-4037.001.patch > > > https://issues.apache.org/jira/browse/TEZ-2447 removed KILLED since sometimes > this status can fail to search all KILLED DAGs. This jira re-adds KILLED dag > status search since it still has value and would rather focus on fixing the > DAGs who fail to write killed status to history log file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4036) TestMockDAGAppMaster#testInternalPreemption should assert for failed state
[ https://issues.apache.org/jira/browse/TEZ-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761014#comment-16761014 ] Kuhu Shukla commented on TEZ-4036: -- [~jeagles]. Request for review. Thanks a lot! > TestMockDAGAppMaster#testInternalPreemption should assert for failed state > -- > > Key: TEZ-4036 > URL: https://issues.apache.org/jira/browse/TEZ-4036 > Project: Apache Tez > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-4036.001.patch > > > Due to root cause mentioned in TEZ-3950, the test fails regularly. Until the > fix for that JIRA is in (which is rather a good amount of redesign) , adding > failed assert to the test as this is now an expected state for the task. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-4036) TestMockDAGAppMaster#testInternalPreemption should assert for failed state
[ https://issues.apache.org/jira/browse/TEZ-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-4036: - Attachment: TEZ-4036.001.patch > TestMockDAGAppMaster#testInternalPreemption should assert for failed state > -- > > Key: TEZ-4036 > URL: https://issues.apache.org/jira/browse/TEZ-4036 > Project: Apache Tez > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-4036.001.patch > > > Due to root cause mentioned in TEZ-3950, the test fails regularly. Until the > fix for that JIRA is in (which is rather a good amount of redesign) , adding > failed assert to the test as this is now an expected state for the task. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TEZ-4036) TestMockDAGAppMaster#testInternalPreemption should assert for failed state
Kuhu Shukla created TEZ-4036: Summary: TestMockDAGAppMaster#testInternalPreemption should assert for failed state Key: TEZ-4036 URL: https://issues.apache.org/jira/browse/TEZ-4036 Project: Apache Tez Issue Type: Bug Reporter: Kuhu Shukla Assignee: Kuhu Shukla Due to root cause mentioned in TEZ-3950, the test fails regularly. Until the fix for that JIRA is in (which is rather a good amount of redesign) , adding failed assert to the test as this is now an expected state for the task. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4032) TEZ will throw "Client cannot authenticate via:[TOKEN, KERBEROS]" when used with HDFS federation(non viewfs, only hdfs schema used).
[ https://issues.apache.org/jira/browse/TEZ-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16750381#comment-16750381 ] Kuhu Shukla commented on TEZ-4032: -- Will take a look asap. > TEZ will throw "Client cannot authenticate via:[TOKEN, KERBEROS]" when used > with HDFS federation(non viewfs, only hdfs schema used). > -- > > Key: TEZ-4032 > URL: https://issues.apache.org/jira/browse/TEZ-4032 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1 >Reporter: zhangbutao >Priority: Major > Attachments: TEZ-4032.001.patch > > > I execute hive tez job in HDFS federation and kerberos. The hadoop cluster > has multiple namespace (hdfs://ns1,hdfs://ns2,hdfs://ns3 ...)and we don't > use viewfs schema. Hive tez job will throw error as follows when the table > is created in hdfs://ns2 (default configuration fs.defaluFS=hdfs://ns1): > {code:java} > 2019-01-21 15:43:46,507 [WARN] [TezChild] |ipc.Client|: Exception encountered > while connecting to the server : > org.apache.hadoop.security.AccessControlException: Client cannot authenticate > via:[TOKEN, KERBEROS] > 2019-01-21 15:43:46,507 [INFO] [TezChild] |retry.RetryInvocationHandler|: > java.io.IOException: DestHost:destPort docker5.cmss.com:8020 , > LocalHost:localPort docker1.cmss.com/10.254.10.116:0. Failed on local > exception: java.io.IOException: > org.apache.hadoop.security.AccessControlException: Client cannot authenticate > via:[TOKEN, KERBEROS], while invoking > ClientNamenodeProtocolTranslatorPB.getFileInfo over > docker5.cmss.com/10.254.2.106:8020 after 14 failover attempts. Trying to > failover after sleeping for 10827ms. > 2019-01-21 15:43:57,338 [WARN] [TezChild] |ipc.Client|: Exception encountered > while connecting to the server : > org.apache.hadoop.security.AccessControlException: Client cannot authenticate > via:[TOKEN, KERBEROS] > 2019-01-21 15:43:57,363 [ERROR] [TezChild] |tez.MapRecordSource|: > org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while > processing writable (null) > at > org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:568) > at > org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92) > at > org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76) > at > org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at > com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108) > at > com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41) > at > com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: > org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: > DestHost:destPort docker4.cmss.com:8020 , LocalHost:localPort > docker1.cmss.com/10.254.10.116:0. Failed on local exception: > java.io.IOException: org.apache.hadoop.security.AccessControlException: > Client cannot authenticate via:[TOKEN, KERBEROS] > at > org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:742) > at > org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:897) > at >
[jira] [Commented] (TEZ-4030) Unable to find hive database name in tez applications
[ https://issues.apache.org/jira/browse/TEZ-4030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734195#comment-16734195 ] Kuhu Shukla commented on TEZ-4030: -- [~Sreenath] Could you help with this? Thanks! > Unable to find hive database name in tez applications > - > > Key: TEZ-4030 > URL: https://issues.apache.org/jira/browse/TEZ-4030 > Project: Apache Tez > Issue Type: Improvement > Components: UI >Affects Versions: 0.8.4 >Reporter: Ashish Doneriya >Priority: Minor > > Currently there is no way that I could find the name of the hive database > using application id. In mapreduce applications I can find the database name > from its configuration but in tez there is no such property. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4027) DagAwareYarnTaskScheduler can miscompute blocked vertices and cause a hang
[ https://issues.apache.org/jira/browse/TEZ-4027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723615#comment-16723615 ] Kuhu Shukla commented on TEZ-4027: -- [~jlowe], [~jeagles] request for comments/review. Thanks a lot! > DagAwareYarnTaskScheduler can miscompute blocked vertices and cause a hang > -- > > Key: TEZ-4027 > URL: https://issues.apache.org/jira/browse/TEZ-4027 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-4027.001.patch, TEZ-4027.002.patch > > > In a scenario where there are retro active failures and the YARN queue is > full to not allow more new container assignments, the scheduler can > miscompute blocked vertex set as it tries to flip the bits upto the length of > the bitset which may not be reflective of the total number of vertices. This > causes no preemption and the DAG will hang. > {code} > @GuardedBy("DagAwareYarnTaskScheduler.this") > BitSet createVertexBlockedSet() { > BitSet blocked = new BitSet(); > Entry entry = priorityStats.lastEntry(); > if (entry != null) { > RequestPriorityStats stats = entry.getValue(); > blocked.or(stats.allowedVertices); > blocked.flip(0, blocked.length()); > blocked.or(stats.descendants); > } > return blocked; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-4027) DagAwareYarnTaskScheduler can miscompute blocked vertices and cause a hang
[ https://issues.apache.org/jira/browse/TEZ-4027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-4027: - Attachment: TEZ-4027.002.patch > DagAwareYarnTaskScheduler can miscompute blocked vertices and cause a hang > -- > > Key: TEZ-4027 > URL: https://issues.apache.org/jira/browse/TEZ-4027 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-4027.001.patch, TEZ-4027.002.patch > > > In a scenario where there are retro active failures and the YARN queue is > full to not allow more new container assignments, the scheduler can > miscompute blocked vertex set as it tries to flip the bits upto the length of > the bitset which may not be reflective of the total number of vertices. This > causes no preemption and the DAG will hang. > {code} > @GuardedBy("DagAwareYarnTaskScheduler.this") > BitSet createVertexBlockedSet() { > BitSet blocked = new BitSet(); > Entry entry = priorityStats.lastEntry(); > if (entry != null) { > RequestPriorityStats stats = entry.getValue(); > blocked.or(stats.allowedVertices); > blocked.flip(0, blocked.length()); > blocked.or(stats.descendants); > } > return blocked; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-4027) DagAwareYarnTaskScheduler can miscompute blocked vertices and cause a hang
[ https://issues.apache.org/jira/browse/TEZ-4027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-4027: - Attachment: TEZ-4027.001.patch > DagAwareYarnTaskScheduler can miscompute blocked vertices and cause a hang > -- > > Key: TEZ-4027 > URL: https://issues.apache.org/jira/browse/TEZ-4027 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-4027.001.patch > > > In a scenario where there are retro active failures and the YARN queue is > full to not allow more new container assignments, the scheduler can > miscompute blocked vertex set as it tries to flip the bits upto the length of > the bitset which may not be reflective of the total number of vertices. This > causes no preemption and the DAG will hang. > {code} > @GuardedBy("DagAwareYarnTaskScheduler.this") > BitSet createVertexBlockedSet() { > BitSet blocked = new BitSet(); > Entry entry = priorityStats.lastEntry(); > if (entry != null) { > RequestPriorityStats stats = entry.getValue(); > blocked.or(stats.allowedVertices); > blocked.flip(0, blocked.length()); > blocked.or(stats.descendants); > } > return blocked; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TEZ-4027) DagAwareYarnTaskScheduler can miscompute blocked vertices and cause a hang
Kuhu Shukla created TEZ-4027: Summary: DagAwareYarnTaskScheduler can miscompute blocked vertices and cause a hang Key: TEZ-4027 URL: https://issues.apache.org/jira/browse/TEZ-4027 Project: Apache Tez Issue Type: Bug Affects Versions: 0.9.1, 0.10.0 Reporter: Kuhu Shukla Assignee: Kuhu Shukla In a scenario where there are retro active failures and the YARN queue is full to not allow more new container assignments, the scheduler can miscompute blocked vertex set as it tries to flip the bits upto the length of the bitset which may not be reflective of the total number of vertices. This causes no preemption and the DAG will hang. {code} @GuardedBy("DagAwareYarnTaskScheduler.this") BitSet createVertexBlockedSet() { BitSet blocked = new BitSet(); Entry entry = priorityStats.lastEntry(); if (entry != null) { RequestPriorityStats stats = entry.getValue(); blocked.or(stats.allowedVertices); blocked.flip(0, blocked.length()); blocked.or(stats.descendants); } return blocked; } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3476) Need a way to account for container localization.
[ https://issues.apache.org/jira/browse/TEZ-3476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695230#comment-16695230 ] Kuhu Shukla commented on TEZ-3476: -- I did some more manual testing and I can see the expected tasks getting speculated that I added a delay in start up by hacking TezChild to have an additional sleep. This patch can be reviewed for an initial pass. [~jeagles], [~jlowe] request for initial comments and suggestions. Thanks a lot! > Need a way to account for container localization. > - > > Key: TEZ-3476 > URL: https://issues.apache.org/jira/browse/TEZ-3476 > Project: Apache Tez > Issue Type: Bug >Reporter: Eric Payne >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3476.001.patch > > > Tez task attempt start times don't reflect time spent in localization. > In the MapReduce framework, the time spent in localization was included in > the total runtime of each task attempt. But since Tez reuses containers, the > time spent localizing for a container is not captured. The start time of the > first attempt in that container will only be set after the localization has > completed. > The result is that attempts can appear as if they are not being run even > though there are resources available in the queue. An attempt can be assigned > to a container, but if the container is on a slow node and it takes a long > time to localize, the attempt state will remain pending until localization > completes. > The impact risk is that tasks will not speculate during localization since > they haven't started -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4004) Update jetty9 to align with Hadoop and Hive
[ https://issues.apache.org/jira/browse/TEZ-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694845#comment-16694845 ] Kuhu Shukla commented on TEZ-4004: -- [~jeagles], is the test failure related for the 0.9 version of the patch? > Update jetty9 to align with Hadoop and Hive > --- > > Key: TEZ-4004 > URL: https://issues.apache.org/jira/browse/TEZ-4004 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Fix For: 0.10.0, 0.10.1 > > Attachments: TEZ-4004.001-branch-0.9.patch, TEZ-4004.001.patch > > > https://abi-laboratory.pro/index.php?view=timeline=java=jetty > https://issues.apache.org/jira/browse/HADOOP-15815 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3476) Need a way to account for container localization.
[ https://issues.apache.org/jira/browse/TEZ-3476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691242#comment-16691242 ] Kuhu Shukla commented on TEZ-3476: -- I am sorry for the late reply [~jeagles]. Thank you for the comments! I have been trying to do some more testing on this patch. bq.Do attempt times now include container launch time? Yes. bq. Will speculation continue to be fair where attempts that required launching a container can be compared with attempts that reused a container? If the initial container startup takes time, the task attempt would be speculated and any task attempt that used an already running container won't be speculated based on container startup slowness as that would be absent. This patches introduces that behavior. bq. Does this patch support the design principle in Tez to separate container and task? For the purposes of task submission event being sent earlier, I do not think it breaks the above mentioned separation except the time accounting and allowing speculation. Please correct me if I did not catch the question correctly. I will post some more test results soon. > Need a way to account for container localization. > - > > Key: TEZ-3476 > URL: https://issues.apache.org/jira/browse/TEZ-3476 > Project: Apache Tez > Issue Type: Bug >Reporter: Eric Payne >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3476.001.patch > > > Tez task attempt start times don't reflect time spent in localization. > In the MapReduce framework, the time spent in localization was included in > the total runtime of each task attempt. But since Tez reuses containers, the > time spent localizing for a container is not captured. The start time of the > first attempt in that container will only be set after the localization has > completed. > The result is that attempts can appear as if they are not being run even > though there are resources available in the queue. An attempt can be assigned > to a container, but if the container is on a slow node and it takes a long > time to localize, the attempt state will remain pending until localization > completes. > The impact risk is that tasks will not speculate during localization since > they haven't started -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3476) Need a way to account for container localization.
[ https://issues.apache.org/jira/browse/TEZ-3476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688295#comment-16688295 ] Kuhu Shukla commented on TEZ-3476: -- The v1 patch is less than ideal but it is a start. This change will undo parts of TEZ-3715 which may be a concern. I can move this into the scheduler service implementations but sending the event would require downcasting which is less than ideal as well. I have a test change and will try to come up with another where I can show that speculation kicks in for a slow container launch. I checked through TestTezJobs that the patch makes the task attempt go into submitted state before the container is launched. Appreciate any reviews. > Need a way to account for container localization. > - > > Key: TEZ-3476 > URL: https://issues.apache.org/jira/browse/TEZ-3476 > Project: Apache Tez > Issue Type: Bug >Reporter: Eric Payne >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3476.001.patch > > > Tez task attempt start times don't reflect time spent in localization. > In the MapReduce framework, the time spent in localization was included in > the total runtime of each task attempt. But since Tez reuses containers, the > time spent localizing for a container is not captured. The start time of the > first attempt in that container will only be set after the localization has > completed. > The result is that attempts can appear as if they are not being run even > though there are resources available in the queue. An attempt can be assigned > to a container, but if the container is on a slow node and it takes a long > time to localize, the attempt state will remain pending until localization > completes. > The impact risk is that tasks will not speculate during localization since > they haven't started -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-3476) Need a way to account for container localization.
[ https://issues.apache.org/jira/browse/TEZ-3476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3476: - Attachment: TEZ-3476.001.patch > Need a way to account for container localization. > - > > Key: TEZ-3476 > URL: https://issues.apache.org/jira/browse/TEZ-3476 > Project: Apache Tez > Issue Type: Bug >Reporter: Eric Payne >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3476.001.patch > > > Tez task attempt start times don't reflect time spent in localization. > In the MapReduce framework, the time spent in localization was included in > the total runtime of each task attempt. But since Tez reuses containers, the > time spent localizing for a container is not captured. The start time of the > first attempt in that container will only be set after the localization has > completed. > The result is that attempts can appear as if they are not being run even > though there are resources available in the queue. An attempt can be assigned > to a container, but if the container is on a slow node and it takes a long > time to localize, the attempt state will remain pending until localization > completes. > The impact risk is that tasks will not speculate during localization since > they haven't started -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (TEZ-3476) Need a way to account for container localization.
[ https://issues.apache.org/jira/browse/TEZ-3476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla reassigned TEZ-3476: Assignee: Kuhu Shukla > Need a way to account for container localization. > - > > Key: TEZ-3476 > URL: https://issues.apache.org/jira/browse/TEZ-3476 > Project: Apache Tez > Issue Type: Bug >Reporter: Eric Payne >Assignee: Kuhu Shukla >Priority: Major > > Tez task attempt start times don't reflect time spent in localization. > In the MapReduce framework, the time spent in localization was included in > the total runtime of each task attempt. But since Tez reuses containers, the > time spent localizing for a container is not captured. The start time of the > first attempt in that container will only be set after the localization has > completed. > The result is that attempts can appear as if they are not being run even > though there are resources available in the queue. An attempt can be assigned > to a container, but if the container is on a slow node and it takes a long > time to localize, the attempt state will remain pending until localization > completes. > The impact risk is that tasks will not speculate during localization since > they haven't started -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TEZ-4019) Modify Tez shuffle handler to use AuxiliaryLocalPathHandler instead of LocalDirAllocator
Kuhu Shukla created TEZ-4019: Summary: Modify Tez shuffle handler to use AuxiliaryLocalPathHandler instead of LocalDirAllocator Key: TEZ-4019 URL: https://issues.apache.org/jira/browse/TEZ-4019 Project: Apache Tez Issue Type: Improvement Reporter: Kuhu Shukla Assignee: Kuhu Shukla Like with the MR shuffle handler , this new API (YARN-7244) exposed in Hadoop version 2.8.2 and up helps keep the NM's view of disks good to use and the auxiliary services' view in sync. Tez right now compiles with 2.7 but when we move that we should allow this new good behavior to come in. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4004) Update jetty9 to align with Hadoop and Hive
[ https://issues.apache.org/jira/browse/TEZ-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672219#comment-16672219 ] Kuhu Shukla commented on TEZ-4004: -- Committed to branch-0.10.0. CC: [~ewohlstadter] for changes required to the branch's CHANGES.txt. Thanks! > Update jetty9 to align with Hadoop and Hive > --- > > Key: TEZ-4004 > URL: https://issues.apache.org/jira/browse/TEZ-4004 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Fix For: 0.10.0, 0.10.1 > > Attachments: TEZ-4004.001-branch-0.9.patch, TEZ-4004.001.patch > > > https://abi-laboratory.pro/index.php?view=timeline=java=jetty > https://issues.apache.org/jira/browse/HADOOP-15815 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-4004) Update jetty9 to align with Hadoop and Hive
[ https://issues.apache.org/jira/browse/TEZ-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-4004: - Fix Version/s: 0.10.0 > Update jetty9 to align with Hadoop and Hive > --- > > Key: TEZ-4004 > URL: https://issues.apache.org/jira/browse/TEZ-4004 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Fix For: 0.10.0, 0.10.1 > > Attachments: TEZ-4004.001-branch-0.9.patch, TEZ-4004.001.patch > > > https://abi-laboratory.pro/index.php?view=timeline=java=jetty > https://issues.apache.org/jira/browse/HADOOP-15815 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-4004) Update jetty9 to align with Hadoop and Hive
[ https://issues.apache.org/jira/browse/TEZ-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-4004: - Fix Version/s: 0.10.1 > Update jetty9 to align with Hadoop and Hive > --- > > Key: TEZ-4004 > URL: https://issues.apache.org/jira/browse/TEZ-4004 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Fix For: 0.10.1 > > Attachments: TEZ-4004.001.patch > > > https://abi-laboratory.pro/index.php?view=timeline=java=jetty > https://issues.apache.org/jira/browse/HADOOP-15815 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4004) Update jetty9 to align with Hadoop and Hive
[ https://issues.apache.org/jira/browse/TEZ-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668759#comment-16668759 ] Kuhu Shukla commented on TEZ-4004: -- Committed to master. [~jeagles] can you provide a patch for 0.9 as there is a conflict. Thank you! > Update jetty9 to align with Hadoop and Hive > --- > > Key: TEZ-4004 > URL: https://issues.apache.org/jira/browse/TEZ-4004 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: TEZ-4004.001.patch > > > https://abi-laboratory.pro/index.php?view=timeline=java=jetty > https://issues.apache.org/jira/browse/HADOOP-15815 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4004) Update jetty9 to align with Hadoop and Hive
[ https://issues.apache.org/jira/browse/TEZ-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668728#comment-16668728 ] Kuhu Shukla commented on TEZ-4004: -- +1 for the patch. Committing this shortly. > Update jetty9 to align with Hadoop and Hive > --- > > Key: TEZ-4004 > URL: https://issues.apache.org/jira/browse/TEZ-4004 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: TEZ-4004.001.patch > > > https://abi-laboratory.pro/index.php?view=timeline=java=jetty > https://issues.apache.org/jira/browse/HADOOP-15815 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-3990) The number of shuffle penalties for a host/inputAttemptIdentifier should be capped
[ https://issues.apache.org/jira/browse/TEZ-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3990: - Fix Version/s: (was: 0.10.0) 0.10.1 > The number of shuffle penalties for a host/inputAttemptIdentifier should be > capped > -- > > Key: TEZ-3990 > URL: https://issues.apache.org/jira/browse/TEZ-3990 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Fix For: 0.9.2, 0.10.1 > > Attachments: TEZ-3990.001.patch, TEZ-3990.002.patch, > TEZ-3990.003.patch, TEZ-3990.004.patch, TEZ-3990.005.patch, TEZ-3990.006.patch > > > In a scenario where the same mapId fetches fail, the penalty code allows > adding the same Host/InputAttemptIdentifier over and over with revised > penalty time that grows exponentially. It should at some point drop the > retrying and report failure to the AM asap to allow the job to rectify the > upstream output. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4012) Add docker support for Tez.
[ https://issues.apache.org/jira/browse/TEZ-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16665608#comment-16665608 ] Kuhu Shukla commented on TEZ-4012: -- Thank you [~jeagles]. +1. Committing this shortly. > Add docker support for Tez. > --- > > Key: TEZ-4012 > URL: https://issues.apache.org/jira/browse/TEZ-4012 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: TEZ-4012.001.patch, TEZ-4012.002.patch, > TEZ-4012.003.patch > > > Hadoop label builds contain a mix of development tools and versions. In > particular H11-H20 are unusable by tez since protoc -version is 2.6.x and > hadoop only supports 2.5.0. This jira will allow builds across all H1-H20 > jenkins machines. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4012) Add docker support for Tez.
[ https://issues.apache.org/jira/browse/TEZ-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16665204#comment-16665204 ] Kuhu Shukla commented on TEZ-4012: -- Thank you for the patch [~jeagles]. Just some more minor comments. {code} # Add a welcome message and environment checks. COPY hadoop_env_checks.sh /root/hadoop_env_checks.sh RUN chmod 755 /root/hadoop_env_checks.sh {code} This should be now referencing the renamed env shell script. > Add docker support for Tez. > --- > > Key: TEZ-4012 > URL: https://issues.apache.org/jira/browse/TEZ-4012 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: TEZ-4012.001.patch, TEZ-4012.002.patch > > > Hadoop label builds contain a mix of development tools and versions. In > particular H11-H20 are unusable by tez since protoc -version is 2.6.x and > hadoop only supports 2.5.0. This jira will allow builds across all H1-H20 > jenkins machines. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TEZ-4012) Add docker support for Tez.
[ https://issues.apache.org/jira/browse/TEZ-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664322#comment-16664322 ] Kuhu Shukla edited comment on TEZ-4012 at 10/25/18 9:50 PM: I am not super familiar with yetus and precommit builds but can the shell script name be changed here from {{hadoop_env_checks}} to maybe {{tez_env_checks}}? The only place it is referenced seems to be in the Dockerfile. was (Author: kshukla): I am not super familiar with yetus and precommit builds but can the shell script name be changed here from {{hadoop_env_checks}} to maybe {{tez_env_checks}}? > Add docker support for Tez. > --- > > Key: TEZ-4012 > URL: https://issues.apache.org/jira/browse/TEZ-4012 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: TEZ-4012.001.patch > > > Hadoop label builds contain a mix of development tools and versions. In > particular H11-H20 are unusable by tez since protoc -version is 2.6.x and > hadoop only supports 2.5.0. This jira will allow builds across all H1-H20 > jenkins machines. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4012) Add docker support for Tez.
[ https://issues.apache.org/jira/browse/TEZ-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664322#comment-16664322 ] Kuhu Shukla commented on TEZ-4012: -- I am not super familiar with yetus and precommit builds but can the shell script name be changed here from {{hadoop_env_checks}} to maybe {{tez_env_checks}}? > Add docker support for Tez. > --- > > Key: TEZ-4012 > URL: https://issues.apache.org/jira/browse/TEZ-4012 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: TEZ-4012.001.patch > > > Hadoop label builds contain a mix of development tools and versions. In > particular H11-H20 are unusable by tez since protoc -version is 2.6.x and > hadoop only supports 2.5.0. This jira will allow builds across all H1-H20 > jenkins machines. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TEZ-4013) Allow intermediate output/spill data encryption
Kuhu Shukla created TEZ-4013: Summary: Allow intermediate output/spill data encryption Key: TEZ-4013 URL: https://issues.apache.org/jira/browse/TEZ-4013 Project: Apache Tez Issue Type: Task Affects Versions: 0.9.1 Reporter: Kuhu Shukla Assignee: Kuhu Shukla Analogous to MAPREDUCE-5890 (Support for encrypting Intermediate data and spills in local filesystem), Tez should support for encrypting spill data. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3976) ShuffleManager reporting too many errors
[ https://issues.apache.org/jira/browse/TEZ-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655924#comment-16655924 ] Kuhu Shukla commented on TEZ-3976: -- Thank you [~jmarhuen] for the patch. I took a cursory look at it and the new config waits for a configurable amount of time before reporting all the failures thus far. This approach looks good and it is different from the ordered case though and I wonder if we should make them consistent? I will look at the patch closely in the mean time. > ShuffleManager reporting too many errors > > > Key: TEZ-3976 > URL: https://issues.apache.org/jira/browse/TEZ-3976 > Project: Apache Tez > Issue Type: Bug >Reporter: Jaume M >Assignee: Jaume M >Priority: Major > Attachments: TEZ-3976.1.patch, TEZ-3976.2.patch, TEZ-3976.3.patch, > TEZ-3976.4.patch, TEZ-3976.5.patch, TEZ-3976.6.patch, TEZ-3976.7.patch > > > The symptoms are a lot of these logs are being shown: > {code:java} > 2018-06-15T18:09:35,811 INFO [Fetcher_B {Reducer_5} #0 ()] > org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: > Fetch failed for src: InputAttemptIdentifier [inputIdentifier=701, > attemptNumber=0, > pathComponent=attempt_152901963_0021_34_01_000701_0_12541_0, spillType=2, > spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=701, > attemptNumber=0, > pathComponent=attempt_152901963_0021_34_01_000701_0_12541_0, spillType=2, > spillId=0], connectFailed: true > 2018-06-15T18:09:35,811 WARN [Fetcher_B {Reducer_5} #1 ()] > org.apache.tez.runtime.library.common.shuffle.Fetcher: copyInputs failed for > tasks [InputAttemptIdentifier [inputIdentifier=589, attemptNumber=0, > pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, > spillId=0]] > 2018-06-15T18:09:35,811 INFO [Fetcher_B {Reducer_5} #1 ()] > org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: > Fetch failed for src: InputAttemptIdentifier [inputIdentifier=589, > attemptNumber=0, > pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, > spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=589, > attemptNumber=0, > pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, > spillId=0], connectFailed: true > {code} > Each of those translate into an event in the AM which finally crashes due to > OOM after around 30 minutes and around 10 million shuffle input errors (and > 10 million lines like the previous ones). When the ShufflerManager is closed > and the counters reported there are many shuffle input errors, some of those > logs are: > {code:java} > 2018-06-15T17:46:30,988 INFO [TezTR-441963_21_34_4_0_4 > (152901963_0021_34_04_00_4)] runtime.LogicalIOProcessorRuntimeTask: > Final Counters for attempt_152901963_0021_34_04_00_4: Counters: 43 > [[org.apache.tez.common.counters.TaskCounter SPILLED_RECORDS=0, > NUM_SHUFFLED_INPUTS=26, NUM_FAILED_SHUFFLE_INPUTS=858965, > INPUT_RECORDS_PROCESSED=26, OUTPUT_RECORDS=1, OUTPUT_LARGE_RECORDS=0, > OUTPUT_BYTES=779472, OUTPUT_BYTES_WITH_OVERHEAD=779483, > OUTPUT_BYTES_PHYSICAL=780146, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, > ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0, > SHUFFLE_BYTES=4207563, SHUFFLE_BYTES_DECOMPRESSED=20266603, > SHUFFLE_BYTES_TO_MEM=3380616, SHUFFLE_BYTES_TO_DISK=0, > SHUFFLE_BYTES_DISK_DIRECT=826947, SHUFFLE_PHASE_TIME=52516, > FIRST_EVENT_RECEIVED=1, LAST_EVENT_RECEIVED=1185][HIVE > RECORDS_OUT_INTERMEDIATE_^[[1;35;40m^[[KReducer_12^[[m^[[K=1, > RECORDS_OUT_OPERATOR_GBY_159=1, > RECORDS_OUT_OPERATOR_RS_160=1][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_INPUT_Map_11 > FIRST_EVENT_RECEIVED=1, INPUT_RECORDS_PROCESSED=26, > LAST_EVENT_RECEIVED=1185, NUM_FAILED_SHUFFLE_INPUTS=858965, > NUM_SHUFFLED_INPUTS=26, SHUFFLE_BYTES=4207563, > SHUFFLE_BYTES_DECOMPRESSED=20266603, SHUFFLE_BYTES_DISK_DIRECT=826947, > SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=3380616, > SHUFFLE_PHASE_TIME=52516][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_OUTPUT_Map_1 > ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, > ADDITIONAL_SPILL_COUNT=0, OUTPUT_BYTES=779472, OUTPUT_BYTES_PHYSICAL=780146, > OUTPUT_BYTES_WITH_OVERHEAD=779483, OUTPUT_LARGE_RECORDS=0, OUTPUT_RECORDS=1, > SPILLED_RECORDS=0]] > 2018-06-15T17:46:32,271 INFO [TezTR-441963_21_34_3_15_1 ()] > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask: Final Counters for > attempt_152901963_0021_34_03_15_1: Counters: 87 [[File System > Counters FILE_BYTES_READ=0, FILE_BYTES_WRITTEN=0, FILE_READ_OPS=0, > FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, HDFS_BYTES_READ=2344929, > HDFS_BYTES_WRITTEN=0, HDFS_READ_OPS=5, HDFS_LARGE_READ_OPS=0, >
[jira] [Commented] (TEZ-3998) Allow CONCURRENT edge property in DAG construction and introduce ConcurrentSchedulingType
[ https://issues.apache.org/jira/browse/TEZ-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651706#comment-16651706 ] Kuhu Shukla commented on TEZ-3998: -- I plan to review this in the next few days. Would be nice to get some comments from [~gopalv] , [~jlowe] and [~jeagles] among others. > Allow CONCURRENT edge property in DAG construction and introduce > ConcurrentSchedulingType > - > > Key: TEZ-3998 > URL: https://issues.apache.org/jira/browse/TEZ-3998 > Project: Apache Tez > Issue Type: Task >Reporter: Yingda Chen >Assignee: Yingda Chen >Priority: Major > > This is the first task related to TEZ-3997 > > |Note: There is no API change in this proposed change. The majority of this > change will be lifting some existing constraints against CONCURRENT edge > type, and addition of a VertexMangerPlugin implementation.| > > This includes enabling the CONCURRENT SchedulingType as a valid edge > property, by removing all the sanity check against CONCURRENT during DAG > construction/execution. A new VertexManagerPlugin (namely > VertexManagerWithConcurrentInput) will be implemented for vertex with > incoming concurrent edge(s). > In addition, we will assume in this change that > * A vertex *cannot* have both SEQUENTIAL and CONCURRENT incoming edges > * No shuffle or data movement is handled by Tez framework when two vertices > are connected through a CONCURRENT edge. Instead, runtime should be > responsible for handling all the data-plane communications (as proposed in > [1]). > Note that the above assumptions are common for scenarios such as whole-DAG or > sub-graph gang scheduling, but they may be relaxed in later implementation, > which may allow mixture of SEQUENTIAL and CONCURRENT edges on the same vertex. > > Most of the (meaningful) scheduling decisions today in Tez are made based on > the notion of (or an extended version of) source task completion. This will > no longer be true in presence of CONCURRENT edge. Instead, events such as > source vertex configured, or source task running will become more relevant > when making scheduling decision for two vertices connected via a CONCURRENT > edge. We therefore introduce a new enum *ConcurrentSchedulingType* to > describe the “scheduling timing” for the downstream vertex in such scenarios. > |public enum ConcurrentSchedulingType{ > /** * trigger downstream vertex tasks scheduling by "configured" event of > upstream vertices */ > SOURCE_VERTEX_CONFIGURED, > /** * trigger downstream vertex tasks scheduling by "running" event of > upstream tasks */ > SOURCE_TASK_STARTED > }| > > Note that in this change, we will only use SOURCE_VERTEX_CONFIGURED as the > scheduling type, which suffice for scenarios of whole-DAG or sub-graph > gang-scheduling, where we want (all the tasks in) the downstream vertex to be > scheduled together with (all the tasks) in the upstream vertex. In this case, > we can leverage the existing onVertexStateUpdated() interface of > VextexMangerPlugin to collect relevant information to assist the scheduling > decision, and *there is no additional API change necessary*. However, in more > subtle case such as the parameter-server example described in Fig. 1, other > scheduling type would be more relevant, therefore the placeholder for > *ConcurrentSchedulingType* will be introduced in this change as part of the > infrastructure work. > > Finally, since we assume that all communications between two vertices > connected via CONCURRENT edge are handled by application runtime, a > CONCURRENT edge will be assigned a DummyEdgeManager that basically mute all > DME/VME handling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3961) Tez UI web.xml tries to reach out to java.sun.com for validation after moving to jetty-9
[ https://issues.apache.org/jira/browse/TEZ-3961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645502#comment-16645502 ] Kuhu Shukla commented on TEZ-3961: -- [~jeagles], request for review. Thanks a lot! > Tez UI web.xml tries to reach out to java.sun.com for validation after moving > to jetty-9 > > > Key: TEZ-3961 > URL: https://issues.apache.org/jira/browse/TEZ-3961 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3961.001.patch, TEZ-3961.002.patch > > > Tez UI can throw a 503 error when hosted on a server that cannot reach public > IPs like java.sun.com which are listed as servers for DTDs in web.xml. This > behavior change comes from moving to jetty 9 (Tez and Hadoop 3.0) which > removed provided schemas that were being shipped with earlier versions. It is > suboptimal even in cases where public IPs are accessible to fetch the DTD for > a very very simple web.xml file. We can choose to either remove the DTD > validation or add dependency explicitly to org.eclipse.jetty.toolchain » > jetty-osgi-servlet-api to allow for this jetty change to not affect the > behavior of tez-ui. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3990) The number of shuffle penalties for a host/inputAttemptIdentifier should be capped
[ https://issues.apache.org/jira/browse/TEZ-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645452#comment-16645452 ] Kuhu Shukla commented on TEZ-3990: -- Opened https://issues.apache.org/jira/browse/TEZ-4005 for unordered feature addition for penalties. > The number of shuffle penalties for a host/inputAttemptIdentifier should be > capped > -- > > Key: TEZ-3990 > URL: https://issues.apache.org/jira/browse/TEZ-3990 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3990.001.patch, TEZ-3990.002.patch, > TEZ-3990.003.patch, TEZ-3990.004.patch, TEZ-3990.005.patch, TEZ-3990.006.patch > > > In a scenario where the same mapId fetches fail, the penalty code allows > adding the same Host/InputAttemptIdentifier over and over with revised > penalty time that grows exponentially. It should at some point drop the > retrying and report failure to the AM asap to allow the job to rectify the > upstream output. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TEZ-4005) Add Host-Source input penalties to Unordered Shuffle
Kuhu Shukla created TEZ-4005: Summary: Add Host-Source input penalties to Unordered Shuffle Key: TEZ-4005 URL: https://issues.apache.org/jira/browse/TEZ-4005 Project: Apache Tez Issue Type: Task Affects Versions: 0.9.1 Reporter: Kuhu Shukla Ordered shuffle has a mechanism to penalize hosts and try exponential waits for retrying. Unordered case is missing this feature. Would be really useful to add this and make shuffle policies consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-3961) Tez UI web.xml tries to reach out to java.sun.com for validation after moving to jetty-9
[ https://issues.apache.org/jira/browse/TEZ-3961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3961: - Attachment: TEZ-3961.002.patch > Tez UI web.xml tries to reach out to java.sun.com for validation after moving > to jetty-9 > > > Key: TEZ-3961 > URL: https://issues.apache.org/jira/browse/TEZ-3961 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3961.001.patch, TEZ-3961.002.patch > > > Tez UI can throw a 503 error when hosted on a server that cannot reach public > IPs like java.sun.com which are listed as servers for DTDs in web.xml. This > behavior change comes from moving to jetty 9 (Tez and Hadoop 3.0) which > removed provided schemas that were being shipped with earlier versions. It is > suboptimal even in cases where public IPs are accessible to fetch the DTD for > a very very simple web.xml file. We can choose to either remove the DTD > validation or add dependency explicitly to org.eclipse.jetty.toolchain » > jetty-osgi-servlet-api to allow for this jetty change to not affect the > behavior of tez-ui. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3990) The number of shuffle penalties for a host/inputAttemptIdentifier should be capped
[ https://issues.apache.org/jira/browse/TEZ-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645421#comment-16645421 ] Kuhu Shukla commented on TEZ-3990: -- The build seems to be having a protoc version issue. > The number of shuffle penalties for a host/inputAttemptIdentifier should be > capped > -- > > Key: TEZ-3990 > URL: https://issues.apache.org/jira/browse/TEZ-3990 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3990.001.patch, TEZ-3990.002.patch, > TEZ-3990.003.patch, TEZ-3990.004.patch, TEZ-3990.005.patch, TEZ-3990.006.patch > > > In a scenario where the same mapId fetches fail, the penalty code allows > adding the same Host/InputAttemptIdentifier over and over with revised > penalty time that grows exponentially. It should at some point drop the > retrying and report failure to the AM asap to allow the job to rectify the > upstream output. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4004) Update jetty9 to align with Hadoop and Hive
[ https://issues.apache.org/jira/browse/TEZ-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645415#comment-16645415 ] Kuhu Shukla commented on TEZ-4004: -- HADOOP-15815 seems to try and go to 9.3.25 jetty (9.3.25.v20180904) although anything 24 (and up?) should be good based on [~kihwal]. I see some new methods in 9.3.25 compared to 9.3.24. > Update jetty9 to align with Hadoop and Hive > --- > > Key: TEZ-4004 > URL: https://issues.apache.org/jira/browse/TEZ-4004 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Major > Attachments: TEZ-4004.001.patch > > > https://abi-laboratory.pro/index.php?view=timeline=java=jetty > https://issues.apache.org/jira/browse/HADOOP-15815 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-3990) The number of shuffle penalties for a host/inputAttemptIdentifier should be capped
[ https://issues.apache.org/jira/browse/TEZ-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3990: - Attachment: TEZ-3990.006.patch > The number of shuffle penalties for a host/inputAttemptIdentifier should be > capped > -- > > Key: TEZ-3990 > URL: https://issues.apache.org/jira/browse/TEZ-3990 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3990.001.patch, TEZ-3990.002.patch, > TEZ-3990.003.patch, TEZ-3990.004.patch, TEZ-3990.005.patch, TEZ-3990.006.patch > > > In a scenario where the same mapId fetches fail, the penalty code allows > adding the same Host/InputAttemptIdentifier over and over with revised > penalty time that grows exponentially. It should at some point drop the > retrying and report failure to the AM asap to allow the job to rectify the > upstream output. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-3990) The number of shuffle penalties for a host/inputAttemptIdentifier should be capped
[ https://issues.apache.org/jira/browse/TEZ-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3990: - Attachment: (was: TEZ-3990.006..patch) > The number of shuffle penalties for a host/inputAttemptIdentifier should be > capped > -- > > Key: TEZ-3990 > URL: https://issues.apache.org/jira/browse/TEZ-3990 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3990.001.patch, TEZ-3990.002.patch, > TEZ-3990.003.patch, TEZ-3990.004.patch, TEZ-3990.005.patch > > > In a scenario where the same mapId fetches fail, the penalty code allows > adding the same Host/InputAttemptIdentifier over and over with revised > penalty time that grows exponentially. It should at some point drop the > retrying and report failure to the AM asap to allow the job to rectify the > upstream output. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3990) The number of shuffle penalties for a host/inputAttemptIdentifier should be capped
[ https://issues.apache.org/jira/browse/TEZ-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645254#comment-16645254 ] Kuhu Shukla commented on TEZ-3990: -- Missed a variable name change. d'oh. > The number of shuffle penalties for a host/inputAttemptIdentifier should be > capped > -- > > Key: TEZ-3990 > URL: https://issues.apache.org/jira/browse/TEZ-3990 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3990.001.patch, TEZ-3990.002.patch, > TEZ-3990.003.patch, TEZ-3990.004.patch, TEZ-3990.005.patch, > TEZ-3990.006..patch > > > In a scenario where the same mapId fetches fail, the penalty code allows > adding the same Host/InputAttemptIdentifier over and over with revised > penalty time that grows exponentially. It should at some point drop the > retrying and report failure to the AM asap to allow the job to rectify the > upstream output. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-3990) The number of shuffle penalties for a host/inputAttemptIdentifier should be capped
[ https://issues.apache.org/jira/browse/TEZ-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3990: - Attachment: TEZ-3990.006..patch > The number of shuffle penalties for a host/inputAttemptIdentifier should be > capped > -- > > Key: TEZ-3990 > URL: https://issues.apache.org/jira/browse/TEZ-3990 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3990.001.patch, TEZ-3990.002.patch, > TEZ-3990.003.patch, TEZ-3990.004.patch, TEZ-3990.005.patch, > TEZ-3990.006..patch > > > In a scenario where the same mapId fetches fail, the penalty code allows > adding the same Host/InputAttemptIdentifier over and over with revised > penalty time that grows exponentially. It should at some point drop the > retrying and report failure to the AM asap to allow the job to rectify the > upstream output. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TEZ-3961) Tez UI web.xml tries to reach out to java.sun.com for validation after moving to jetty-9
[ https://issues.apache.org/jira/browse/TEZ-3961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645213#comment-16645213 ] Kuhu Shukla edited comment on TEZ-3961 at 10/10/18 4:25 PM: Thank you for the ping [~jeagles]. This patch is simple but needs review on whether it is ok to remove the DOCTYPE declaration. was (Author: kshukla): Thank you for the ping [~jeagles]. This patch is simple but needs review on whether it is ok to emovethe DOCTYPE declaration. > Tez UI web.xml tries to reach out to java.sun.com for validation after moving > to jetty-9 > > > Key: TEZ-3961 > URL: https://issues.apache.org/jira/browse/TEZ-3961 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3961.001.patch > > > Tez UI can throw a 503 error when hosted on a server that cannot reach public > IPs like java.sun.com which are listed as servers for DTDs in web.xml. This > behavior change comes from moving to jetty 9 (Tez and Hadoop 3.0) which > removed provided schemas that were being shipped with earlier versions. It is > suboptimal even in cases where public IPs are accessible to fetch the DTD for > a very very simple web.xml file. We can choose to either remove the DTD > validation or add dependency explicitly to org.eclipse.jetty.toolchain » > jetty-osgi-servlet-api to allow for this jetty change to not affect the > behavior of tez-ui. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-3961) Tez UI web.xml tries to reach out to java.sun.com for validation after moving to jetty-9
[ https://issues.apache.org/jira/browse/TEZ-3961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3961: - Attachment: TEZ-3961.001.patch > Tez UI web.xml tries to reach out to java.sun.com for validation after moving > to jetty-9 > > > Key: TEZ-3961 > URL: https://issues.apache.org/jira/browse/TEZ-3961 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3961.001.patch > > > Tez UI can throw a 503 error when hosted on a server that cannot reach public > IPs like java.sun.com which are listed as servers for DTDs in web.xml. This > behavior change comes from moving to jetty 9 (Tez and Hadoop 3.0) which > removed provided schemas that were being shipped with earlier versions. It is > suboptimal even in cases where public IPs are accessible to fetch the DTD for > a very very simple web.xml file. We can choose to either remove the DTD > validation or add dependency explicitly to org.eclipse.jetty.toolchain » > jetty-osgi-servlet-api to allow for this jetty change to not affect the > behavior of tez-ui. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-3990) The number of shuffle penalties for a host/inputAttemptIdentifier should be capped
[ https://issues.apache.org/jira/browse/TEZ-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3990: - Attachment: TEZ-3990.005.patch > The number of shuffle penalties for a host/inputAttemptIdentifier should be > capped > -- > > Key: TEZ-3990 > URL: https://issues.apache.org/jira/browse/TEZ-3990 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3990.001.patch, TEZ-3990.002.patch, > TEZ-3990.003.patch, TEZ-3990.004.patch, TEZ-3990.005.patch > > > In a scenario where the same mapId fetches fail, the penalty code allows > adding the same Host/InputAttemptIdentifier over and over with revised > penalty time that grows exponentially. It should at some point drop the > retrying and report failure to the AM asap to allow the job to rectify the > upstream output. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TEZ-3990) The number of shuffle penalties for a host/inputAttemptIdentifier should be capped
[ https://issues.apache.org/jira/browse/TEZ-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642016#comment-16642016 ] Kuhu Shukla edited comment on TEZ-3990 at 10/8/18 3:32 PM: --- Addressed comments by [~jeagles]. Agreed on the issues mentioned with delay calculation and testability. [~jeagles], should I go ahead and create JIRAs for these issues? P.S. The unordered case doesn't seem to have the concept of penalties fyi.. which is odd.. was (Author: kshukla): Addressed comments by [~jeagles]. Agreed on the issues mentioned with delay calculation and testability. [~jeagles], should I go ahead and create JIRAs for these issues? > The number of shuffle penalties for a host/inputAttemptIdentifier should be > capped > -- > > Key: TEZ-3990 > URL: https://issues.apache.org/jira/browse/TEZ-3990 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3990.001.patch, TEZ-3990.002.patch, > TEZ-3990.003.patch, TEZ-3990.004.patch > > > In a scenario where the same mapId fetches fail, the penalty code allows > adding the same Host/InputAttemptIdentifier over and over with revised > penalty time that grows exponentially. It should at some point drop the > retrying and report failure to the AM asap to allow the job to rectify the > upstream output. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3990) The number of shuffle penalties for a host/inputAttemptIdentifier should be capped
[ https://issues.apache.org/jira/browse/TEZ-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642016#comment-16642016 ] Kuhu Shukla commented on TEZ-3990: -- Addressed comments by [~jeagles]. Agreed on the issues mentioned with delay calculation and testability. [~jeagles], should I go ahead and create JIRAs for these issues? > The number of shuffle penalties for a host/inputAttemptIdentifier should be > capped > -- > > Key: TEZ-3990 > URL: https://issues.apache.org/jira/browse/TEZ-3990 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3990.001.patch, TEZ-3990.002.patch, > TEZ-3990.003.patch, TEZ-3990.004.patch > > > In a scenario where the same mapId fetches fail, the penalty code allows > adding the same Host/InputAttemptIdentifier over and over with revised > penalty time that grows exponentially. It should at some point drop the > retrying and report failure to the AM asap to allow the job to rectify the > upstream output. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-3990) The number of shuffle penalties for a host/inputAttemptIdentifier should be capped
[ https://issues.apache.org/jira/browse/TEZ-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3990: - Attachment: TEZ-3990.004.patch > The number of shuffle penalties for a host/inputAttemptIdentifier should be > capped > -- > > Key: TEZ-3990 > URL: https://issues.apache.org/jira/browse/TEZ-3990 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3990.001.patch, TEZ-3990.002.patch, > TEZ-3990.003.patch, TEZ-3990.004.patch > > > In a scenario where the same mapId fetches fail, the penalty code allows > adding the same Host/InputAttemptIdentifier over and over with revised > penalty time that grows exponentially. It should at some point drop the > retrying and report failure to the AM asap to allow the job to rectify the > upstream output. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-4002) CHANGES.txt for 0.9.2 Release
[ https://issues.apache.org/jira/browse/TEZ-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-4002: - Attachment: TEZ-4002.001.patch > CHANGES.txt for 0.9.2 Release > - > > Key: TEZ-4002 > URL: https://issues.apache.org/jira/browse/TEZ-4002 > Project: Apache Tez > Issue Type: Task >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-4002.001.patch > > > Add CHANGES.txt for 0.9.2 line. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TEZ-4002) CHANGES.txt for 0.9.2 Release
Kuhu Shukla created TEZ-4002: Summary: CHANGES.txt for 0.9.2 Release Key: TEZ-4002 URL: https://issues.apache.org/jira/browse/TEZ-4002 Project: Apache Tez Issue Type: Task Reporter: Kuhu Shukla Assignee: Kuhu Shukla Add CHANGES.txt for 0.9.2 line. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3982) DAGAppMaster and tasks should not report negative or invalid progress
[ https://issues.apache.org/jira/browse/TEZ-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623653#comment-16623653 ] Kuhu Shukla commented on TEZ-3982: -- Sorry about that. Changed the clock to SystemClock which is not deprecated in 2.8 Hadoop. [~jlowe] thank you for catching this. Request for review. > DAGAppMaster and tasks should not report negative or invalid progress > - > > Key: TEZ-3982 > URL: https://issues.apache.org/jira/browse/TEZ-3982 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Fix For: 0.10.1 > > Attachments: TEZ-3982.001.patch, TEZ-3982.002.patch, > TEZ-3982.003.patch, TEZ-3982.004.patch, TEZ-3982.005.branch-0.9.patch > > > AM fails (AMRMClient expects non negative progress) if any component reports > invalid or -ve progress, DagAppMaster/Tasks should check and report > accordingly to allow the AM to execute. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-3982) DAGAppMaster and tasks should not report negative or invalid progress
[ https://issues.apache.org/jira/browse/TEZ-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3982: - Attachment: TEZ-3982.005.branch-0.9.patch > DAGAppMaster and tasks should not report negative or invalid progress > - > > Key: TEZ-3982 > URL: https://issues.apache.org/jira/browse/TEZ-3982 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Fix For: 0.10.1 > > Attachments: TEZ-3982.001.patch, TEZ-3982.002.patch, > TEZ-3982.003.patch, TEZ-3982.004.patch, TEZ-3982.005.branch-0.9.patch > > > AM fails (AMRMClient expects non negative progress) if any component reports > invalid or -ve progress, DagAppMaster/Tasks should check and report > accordingly to allow the AM to execute. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3982) DAGAppMaster and tasks should not report negative or invalid progress
[ https://issues.apache.org/jira/browse/TEZ-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622161#comment-16622161 ] Kuhu Shukla commented on TEZ-3982: -- [~jeagles] if the total vertices are 0, do we want to report NaN or 0 progress? That would help decide if the code mentioned above is required. Appreciate the inputs. > DAGAppMaster and tasks should not report negative or invalid progress > - > > Key: TEZ-3982 > URL: https://issues.apache.org/jira/browse/TEZ-3982 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3982.001.patch, TEZ-3982.002.patch, > TEZ-3982.003.patch, TEZ-3982.004.patch > > > AM fails (AMRMClient expects non negative progress) if any component reports > invalid or -ve progress, DagAppMaster/Tasks should check and report > accordingly to allow the AM to execute. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TEZ-3198) Shuffle failures for the trailing task in a vertex are often fatal to the entire DAG
[ https://issues.apache.org/jira/browse/TEZ-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla resolved TEZ-3198. -- Resolution: Duplicate Yes. It will certainly allow the AM to retry the attempt sooner. > Shuffle failures for the trailing task in a vertex are often fatal to the > entire DAG > > > Key: TEZ-3198 > URL: https://issues.apache.org/jira/browse/TEZ-3198 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0, 0.8.2 >Reporter: Jason Lowe >Priority: Critical > > I've seen an increasing number of cases where a single-node failure caused > the whole Tez DAG to fail. These scenarios are common in that they involve > the last task of a vertex attempting to complete a shuffle where all the peer > tasks have already finished shuffling. The last task's attempt encounters > errors shuffling one of its inputs and keeps reporting it to the AM. > Eventually the attempt decides it must be the cause of the shuffle error and > fails. The subsequent attempts all do the same thing, and eventually we hit > the task max attempts limit and fail the vertex and DAG. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3982) DAGAppMaster and tasks should not report negative or invalid progress
[ https://issues.apache.org/jira/browse/TEZ-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621223#comment-16621223 ] Kuhu Shukla commented on TEZ-3982: -- Thanks [~jeagles], addressed the comment and kept the similar check after the division by totalVertices is done. > DAGAppMaster and tasks should not report negative or invalid progress > - > > Key: TEZ-3982 > URL: https://issues.apache.org/jira/browse/TEZ-3982 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3982.001.patch, TEZ-3982.002.patch, > TEZ-3982.003.patch, TEZ-3982.004.patch > > > AM fails (AMRMClient expects non negative progress) if any component reports > invalid or -ve progress, DagAppMaster/Tasks should check and report > accordingly to allow the AM to execute. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-3982) DAGAppMaster and tasks should not report negative or invalid progress
[ https://issues.apache.org/jira/browse/TEZ-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3982: - Attachment: TEZ-3982.004.patch > DAGAppMaster and tasks should not report negative or invalid progress > - > > Key: TEZ-3982 > URL: https://issues.apache.org/jira/browse/TEZ-3982 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3982.001.patch, TEZ-3982.002.patch, > TEZ-3982.003.patch, TEZ-3982.004.patch > > > AM fails (AMRMClient expects non negative progress) if any component reports > invalid or -ve progress, DagAppMaster/Tasks should check and report > accordingly to allow the AM to execute. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3972) Tez DAG can hang when a single task fails to fetch
[ https://issues.apache.org/jira/browse/TEZ-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620018#comment-16620018 ] Kuhu Shukla commented on TEZ-3972: -- Thanks [~jeagles]! > Tez DAG can hang when a single task fails to fetch > -- > > Key: TEZ-3972 > URL: https://issues.apache.org/jira/browse/TEZ-3972 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Fix For: 0.9.2, 0.10.1 > > Attachments: TEZ-3972.001.patch, TEZ-3972.002.patch, > TEZ-3972.003.patch > > > Description of the hung DAG: > A DAG with 2 vertices. {{Map}} Vertex has 22k maps, downstream vertex > {{Reduce}} has 1009 tasks. All tasks succeed but one, which hangs. This one > task (attempt) is doing a local fetch from a node that (now) has a bad disk. > It fails to fetch and reports to the AM for the offending input attempt > identifiers. However the AM does not schedule a re-run as > {{uniquefailedOutputReports}} size is 1 (since only this task attempt failed > to fetch) and failure fraction is not met. The denominator for this fraction > is the total number of tasks. That causes the re-run to never occur. This > JIRA tracks the AM side of the change to alleviate this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-3972) Tez DAG can hang when a single task fails to fetch
[ https://issues.apache.org/jira/browse/TEZ-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3972: - Attachment: TEZ-3972.003.patch > Tez DAG can hang when a single task fails to fetch > -- > > Key: TEZ-3972 > URL: https://issues.apache.org/jira/browse/TEZ-3972 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3972.001.patch, TEZ-3972.002.patch, > TEZ-3972.003.patch > > > Description of the hung DAG: > A DAG with 2 vertices. {{Map}} Vertex has 22k maps, downstream vertex > {{Reduce}} has 1009 tasks. All tasks succeed but one, which hangs. This one > task (attempt) is doing a local fetch from a node that (now) has a bad disk. > It fails to fetch and reports to the AM for the offending input attempt > identifiers. However the AM does not schedule a re-run as > {{uniquefailedOutputReports}} size is 1 (since only this task attempt failed > to fetch) and failure fraction is not met. The denominator for this fraction > is the total number of tasks. That causes the re-run to never occur. This > JIRA tracks the AM side of the change to alleviate this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3990) The number of shuffle penalties for a host/inputAttemptIdentifier should be capped
[ https://issues.apache.org/jira/browse/TEZ-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617601#comment-16617601 ] Kuhu Shukla commented on TEZ-3990: -- Minor change to add the new config to runtime keys that are expected. Fixes the unit test failure. > The number of shuffle penalties for a host/inputAttemptIdentifier should be > capped > -- > > Key: TEZ-3990 > URL: https://issues.apache.org/jira/browse/TEZ-3990 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3990.001.patch, TEZ-3990.002.patch, > TEZ-3990.003.patch > > > In a scenario where the same mapId fetches fail, the penalty code allows > adding the same Host/InputAttemptIdentifier over and over with revised > penalty time that grows exponentially. It should at some point drop the > retrying and report failure to the AM asap to allow the job to rectify the > upstream output. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-3990) The number of shuffle penalties for a host/inputAttemptIdentifier should be capped
[ https://issues.apache.org/jira/browse/TEZ-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3990: - Attachment: TEZ-3990.003.patch > The number of shuffle penalties for a host/inputAttemptIdentifier should be > capped > -- > > Key: TEZ-3990 > URL: https://issues.apache.org/jira/browse/TEZ-3990 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3990.001.patch, TEZ-3990.002.patch, > TEZ-3990.003.patch > > > In a scenario where the same mapId fetches fail, the penalty code allows > adding the same Host/InputAttemptIdentifier over and over with revised > penalty time that grows exponentially. It should at some point drop the > retrying and report failure to the AM asap to allow the job to rectify the > upstream output. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3990) The number of shuffle penalties for a host/inputAttemptIdentifier should be capped
[ https://issues.apache.org/jira/browse/TEZ-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16615349#comment-16615349 ] Kuhu Shukla commented on TEZ-3990: -- Updated patch with penalty cap based on a configured max time in milliseconds. Adds an entry for every failure occurrence to allow retries at some point by the AM if threshold is reached. > The number of shuffle penalties for a host/inputAttemptIdentifier should be > capped > -- > > Key: TEZ-3990 > URL: https://issues.apache.org/jira/browse/TEZ-3990 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3990.001.patch, TEZ-3990.002.patch > > > In a scenario where the same mapId fetches fail, the penalty code allows > adding the same Host/InputAttemptIdentifier over and over with revised > penalty time that grows exponentially. It should at some point drop the > retrying and report failure to the AM asap to allow the job to rectify the > upstream output. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEZ-3990) The number of shuffle penalties for a host/inputAttemptIdentifier should be capped
[ https://issues.apache.org/jira/browse/TEZ-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3990: - Attachment: TEZ-3990.002.patch > The number of shuffle penalties for a host/inputAttemptIdentifier should be > capped > -- > > Key: TEZ-3990 > URL: https://issues.apache.org/jira/browse/TEZ-3990 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3990.001.patch, TEZ-3990.002.patch > > > In a scenario where the same mapId fetches fail, the penalty code allows > adding the same Host/InputAttemptIdentifier over and over with revised > penalty time that grows exponentially. It should at some point drop the > retrying and report failure to the AM asap to allow the job to rectify the > upstream output. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3990) The number of shuffle penalties for a host/inputAttemptIdentifier should be capped
[ https://issues.apache.org/jira/browse/TEZ-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614003#comment-16614003 ] Kuhu Shukla commented on TEZ-3990: -- Hmm, so after some more offline discussion [~jeagles], this patch won't fully address the issue specific to penalties and would cap the signaling to the AM, which is not what we want. To clarify, it is important to not allow for indefinite exponential growth on penalties delay. It makes sending AM signals spaced out farther and makes it difficult for the upstream to run and increases the overall runtime of the downstream as well. What we can instead do is cap the delay based on the value calculated and start over with factor of one again to allow aggressive signaling or cap the delay at that for all future occurrences to allow for debugging and provide constant value function after one window of exponential growth per MapHost. Appreciate more comments and I will post a revised (and hopefully a functional capping mechanism) patch soon. > The number of shuffle penalties for a host/inputAttemptIdentifier should be > capped > -- > > Key: TEZ-3990 > URL: https://issues.apache.org/jira/browse/TEZ-3990 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3990.001.patch > > > In a scenario where the same mapId fetches fail, the penalty code allows > adding the same Host/InputAttemptIdentifier over and over with revised > penalty time that grows exponentially. It should at some point drop the > retrying and report failure to the AM asap to allow the job to rectify the > upstream output. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3990) The number of shuffle penalties for a host/inputAttemptIdentifier should be capped
[ https://issues.apache.org/jira/browse/TEZ-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613568#comment-16613568 ] Kuhu Shukla commented on TEZ-3990: -- [~jeagles] thoughts? Thanks a lot! > The number of shuffle penalties for a host/inputAttemptIdentifier should be > capped > -- > > Key: TEZ-3990 > URL: https://issues.apache.org/jira/browse/TEZ-3990 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: TEZ-3990.001.patch > > > In a scenario where the same mapId fetches fail, the penalty code allows > adding the same Host/InputAttemptIdentifier over and over with revised > penalty time that grows exponentially. It should at some point drop the > retrying and report failure to the AM asap to allow the job to rectify the > upstream output. -- This message was sent by Atlassian JIRA (v7.6.3#76005)