[jira] [Resolved] (TEZ-3982) DAGAppMaster and tasks should not report negative or invalid progress
[ https://issues.apache.org/jira/browse/TEZ-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe resolved TEZ-3982. - Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 0.9.2 Thanks, [~kshukla]! I committed this to branch-0.9. > DAGAppMaster and tasks should not report negative or invalid progress > - > > Key: TEZ-3982 > URL: https://issues.apache.org/jira/browse/TEZ-3982 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Fix For: 0.9.2, 0.10.1 > > Attachments: TEZ-3982.001.patch, TEZ-3982.002.patch, > TEZ-3982.003.patch, TEZ-3982.004.patch, TEZ-3982.005.branch-0.9.patch > > > AM fails (AMRMClient expects non negative progress) if any component reports > invalid or -ve progress, DagAppMaster/Tasks should check and report > accordingly to allow the AM to execute. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (TEZ-3982) DAGAppMaster and tasks should not report negative or invalid progress
[ https://issues.apache.org/jira/browse/TEZ-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe reopened TEZ-3982: - This broke the branch-0.9 build. Looks like MonotonicClock isn't in the version of Hadoop branch-0.9 depends upon: {noformat} [ERROR] /tez/tez-dag/src/test/java/org/apache/tez/dag/app/TestDAGAppMaster.java:[17,35] cannot find symbol symbol: class MonotonicClock location: package org.apache.hadoop.yarn.util [ERROR] /tez/tez-dag/src/test/java/org/apache/tez/dag/app/TestDAGAppMaster.java:[431,32] cannot find symbol symbol: class MonotonicClock location: class org.apache.tez.dag.app.TestDAGAppMaster [INFO] 2 errors {noformat} I reverted this from branch-0.9 to fix the build. > DAGAppMaster and tasks should not report negative or invalid progress > - > > Key: TEZ-3982 > URL: https://issues.apache.org/jira/browse/TEZ-3982 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.9.1, 0.10.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Fix For: 0.10.1 > > Attachments: TEZ-3982.001.patch, TEZ-3982.002.patch, > TEZ-3982.003.patch, TEZ-3982.004.patch > > > AM fails (AMRMClient expects non negative progress) if any component reports > invalid or -ve progress, DagAppMaster/Tasks should check and report > accordingly to allow the AM to execute. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TEZ-3989) Fix by-laws related to emeritus clause
[ https://issues.apache.org/jira/browse/TEZ-3989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe resolved TEZ-3989. - Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 0.10.1 Thanks, [~hitesh]! I committed this to master. > Fix by-laws related to emeritus clause > --- > > Key: TEZ-3989 > URL: https://issues.apache.org/jira/browse/TEZ-3989 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah >Priority: Major > Fix For: 0.10.1 > > > The emeritus clause is not valid and needs to be updated. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TEZ-3935) DAG aware scheduler should release unassigned new containers rather than hold them
Jason Lowe created TEZ-3935: --- Summary: DAG aware scheduler should release unassigned new containers rather than hold them Key: TEZ-3935 URL: https://issues.apache.org/jira/browse/TEZ-3935 Project: Apache Tez Issue Type: Bug Reporter: Jason Lowe Assignee: Jason Lowe I saw a case for a very large job with many containers where the DAG aware scheduler was getting behind on assigning containers. Newly assigned containers were not finding any matching request, so they were queued for reuse processing. However it took so long to get through all of the task and container events that the container allocations expired before the container was finally assigned and attempted to be launched. Newly assigned containers are assigned to their matching requests, even if that violates the DAG priorities, so it should be safe to simply release these if no tasks could be found to use them. The matching request has either been removed or already satisified with a reused container. Besides, if we can't find any tasks to take the newly assigned container then it is very likely we have plenty of reusable containers already, and keeping more containers just makes the job a resource hog on the cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (TEZ-3913) Precommit build fails to post to JIRA
[ https://issues.apache.org/jira/browse/TEZ-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe reopened TEZ-3913: - > Precommit build fails to post to JIRA > - > > Key: TEZ-3913 > URL: https://issues.apache.org/jira/browse/TEZ-3913 > Project: Apache Tez > Issue Type: Bug >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Major > Fix For: 0.9.2 > > Attachments: TEZ-3913.001.patch > > > The precommit build is failing to post comments to Jira due to a 404 error: > {noformat} > Unable to log in to server: > https://issues.apache.org/jira/rpc/soap/jirasoapservice-v2 with user: tezqa. > Cause: (404)404 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TEZ-3913) Precommit build fails to post to JIRA
Jason Lowe created TEZ-3913: --- Summary: Precommit build fails to post to JIRA Key: TEZ-3913 URL: https://issues.apache.org/jira/browse/TEZ-3913 Project: Apache Tez Issue Type: Bug Reporter: Jason Lowe Assignee: Jason Lowe The precommit build is failing to post comments to Jira due to a 404 error: {noformat} Unable to log in to server: https://issues.apache.org/jira/rpc/soap/jirasoapservice-v2 with user: tezqa. Cause: (404)404 {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TEZ-3898) TestTezCommonUtils fails when compiled against hadoop version >= 2.8
Jason Lowe created TEZ-3898: --- Summary: TestTezCommonUtils fails when compiled against hadoop version >= 2.8 Key: TEZ-3898 URL: https://issues.apache.org/jira/browse/TEZ-3898 Project: Apache Tez Issue Type: Bug Reporter: Jason Lowe Assignee: Jason Lowe TestTezCommonUtils fails when compiled against hadoop 2.8 or later: {noformat} $ cd tez-api $ mvn test -Phadoop28 -P-hadoop27 -Dhadoop.version=2.8.3 -Dtest=TestTezCommonUtilsRunning org.apache.tez.common.TestTezCommonUtils Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.266 sec <<< FAILURE! org.apache.tez.common.TestTezCommonUtils Time elapsed: 0.265 sec <<< ERROR! java.lang.NoClassDefFoundError: org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetFactory at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.hadoop.hdfs.server.datanode.FsDatasetTestUtils$Factory.getFactory(FsDatasetTestUtils.java:47) at org.apache.hadoop.hdfs.MiniDFSCluster$Builder.(MiniDFSCluster.java:199) at org.apache.tez.common.TestTezCommonUtils.setup(TestTezCommonUtils.java:60) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TEZ-3896) TestATSV15HistoryLoggingService#testNonSessionDomains is failing
Jason Lowe created TEZ-3896: --- Summary: TestATSV15HistoryLoggingService#testNonSessionDomains is failing Key: TEZ-3896 URL: https://issues.apache.org/jira/browse/TEZ-3896 Project: Apache Tez Issue Type: Bug Reporter: Jason Lowe Assignee: Jason Lowe TestATSV15HistoryLoggingService always fails: {noformat} Running org.apache.tez.dag.history.logging.ats.TestATSV15HistoryLoggingService Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.789 sec <<< FAILURE! testNonSessionDomains(org.apache.tez.dag.history.logging.ats.TestATSV15HistoryLoggingService) Time elapsed: 0.477 sec <<< FAILURE! org.mockito.exceptions.verification.TooManyActualInvocations: historyACLPolicyManager.updateTimelineEntityDomain( , "session-id" ); Wanted 5 times: -> at org.apache.tez.dag.history.logging.ats.TestATSV15HistoryLoggingService.testNonSessionDomains(TestATSV15HistoryLoggingService.java:231) But was 6 times. Undesired invocation: -> at org.apache.tez.dag.history.logging.ats.ATSV15HistoryLoggingService.logEntity(ATSV15HistoryLoggingService.java:389) at org.apache.tez.dag.history.logging.ats.TestATSV15HistoryLoggingService.testNonSessionDomains(TestATSV15HistoryLoggingService.java:231) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TEZ-3821) Ability to fail fast tasks that write too much to local disk
Jason Lowe created TEZ-3821: --- Summary: Ability to fail fast tasks that write too much to local disk Key: TEZ-3821 URL: https://issues.apache.org/jira/browse/TEZ-3821 Project: Apache Tez Issue Type: Bug Reporter: Jason Lowe It would be nice to have a configurable limit such that any task that wrote data to the local filesystem beyond that limit would fail quickly rather than waiting for the disk to fill much later, impacting other jobs on the cluster. This is essentially asking for the Tez version of MAPREDUCE-6489. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (TEZ-3770) DAG-aware YARN task scheduler
Jason Lowe created TEZ-3770: --- Summary: DAG-aware YARN task scheduler Key: TEZ-3770 URL: https://issues.apache.org/jira/browse/TEZ-3770 Project: Apache Tez Issue Type: New Feature Reporter: Jason Lowe Assignee: Jason Lowe There are cases where priority alone does not convey the relationship between tasks, and this can cause problems when scheduling or preempting tasks. If the YARN task scheduler was aware of the relationship between tasks then it could make smarter decisions when trying to assign tasks to containers or preempt running tasks to schedule pending tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (TEZ-3744) Findbug warnings after TEZ-3334 merge
Jason Lowe created TEZ-3744: --- Summary: Findbug warnings after TEZ-3334 merge Key: TEZ-3744 URL: https://issues.apache.org/jira/browse/TEZ-3744 Project: Apache Tez Issue Type: Bug Affects Versions: 0.9.0 Reporter: Jason Lowe There are findbug warnings in precommit builds that appear to be caused by the recent TEZ-3334 merge. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (TEZ-3741) Tez outputs should free memory when closed
Jason Lowe created TEZ-3741: --- Summary: Tez outputs should free memory when closed Key: TEZ-3741 URL: https://issues.apache.org/jira/browse/TEZ-3741 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.1, 0.9.0 Reporter: Jason Lowe Assignee: Jason Lowe Memory buffers aren't being released as quickly as they could be, e.g.: DefaultSorter is holding onto the very large kvbuffer byte array even after close() is called, and Ordered and Unordered outputs should remove references to sorter and kvWriter in their close. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (TEZ-3738) TestUnorderedPartitionedKVWriter fails due to RejectedExecutionException
[ https://issues.apache.org/jira/browse/TEZ-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe resolved TEZ-3738. - Resolution: Duplicate > TestUnorderedPartitionedKVWriter fails due to RejectedExecutionException > > > Key: TEZ-3738 > URL: https://issues.apache.org/jira/browse/TEZ-3738 > Project: Apache Tez > Issue Type: Bug >Reporter: Jason Lowe > > TestUnorderedPartitionedKVWriter is failing in recent precommit builds. > Stacktrace to follow. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (TEZ-3702) Tez shuffle jar includes service loader entry for ClientProtocolProvider but not the corresponding class
[ https://issues.apache.org/jira/browse/TEZ-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe resolved TEZ-3702. - Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: TEZ-3334 Thanks for the reviews! I committed this to the TEZ-3334 branch. > Tez shuffle jar includes service loader entry for ClientProtocolProvider but > not the corresponding class > > > Key: TEZ-3702 > URL: https://issues.apache.org/jira/browse/TEZ-3702 > Project: Apache Tez > Issue Type: Sub-task >Affects Versions: TEZ-3334 >Reporter: Jason Lowe >Assignee: Jason Lowe > Fix For: TEZ-3334 > > Attachments: TEZ-3702.001.patch > > > The tez-aux-shuffle jar is shading the tez-mapreduce dependency but that > causes the service loader entry for > org.apache.hadoop.mapreduce.protocol.ClientProtocolProvider to be included > without including the referenced > org.apache.tez.mapreduce.client.YarnTezClientProtocolProvider class. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (TEZ-3695) TestTezSharedExecutor fails sporadically
Jason Lowe created TEZ-3695: --- Summary: TestTezSharedExecutor fails sporadically Key: TEZ-3695 URL: https://issues.apache.org/jira/browse/TEZ-3695 Project: Apache Tez Issue Type: Bug Reporter: Jason Lowe TestTezSharedExecutor#testSerialExecution is timing out more often than not for me when running the full TezTezSharedExecutor test suite. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (TEZ-3693) ControlledClock is not used
Jason Lowe created TEZ-3693: --- Summary: ControlledClock is not used Key: TEZ-3693 URL: https://issues.apache.org/jira/browse/TEZ-3693 Project: Apache Tez Issue Type: Bug Reporter: Jason Lowe Priority: Trivial The org.apache.tez.dag.app.ControlledClock class is not referenced in the source. Oddly this is not a test class, like MockClock, as I would have expected. If this is not part of the Tez API then it can be removed. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (TEZ-3535) YarnTaskScheduler can hold onto low priority containers until they expire
Jason Lowe created TEZ-3535: --- Summary: YarnTaskScheduler can hold onto low priority containers until they expire Key: TEZ-3535 URL: https://issues.apache.org/jira/browse/TEZ-3535 Project: Apache Tez Issue Type: Bug Affects Versions: 0.8.4, 0.7.1 Reporter: Jason Lowe Assignee: Jason Lowe With container reuse enabled, YarnTaskScheduler will retain but not schedule any container allocations that are lower priority than the highest priority task requests. This can lead to poor performance as these lower priority containers clog up resources needed for high priority allocations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3508) TestTaskScheduler cleanup
Jason Lowe created TEZ-3508: --- Summary: TestTaskScheduler cleanup Key: TEZ-3508 URL: https://issues.apache.org/jira/browse/TEZ-3508 Project: Apache Tez Issue Type: Test Reporter: Jason Lowe Assignee: Jason Lowe TestTaskScheduler is very fragile, since it builds mocks of the AMRM client that is tied very specifically to the particulars of the way the YarnTaskScheduler is coded. Any variance in that often leads to test failures because the mocks no longer accurately reflect what the real AMRM client does. It would be much simpler and more robust to leverage the AMRMClientForTest and AMRMAsyncClientForTest classes in TestTaskSchedulerHelpers rather than maintain fragile mocks attempting to emulate the behaviors of those classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3491) Tez job can hang due to container priority inversion
Jason Lowe created TEZ-3491: --- Summary: Tez job can hang due to container priority inversion Key: TEZ-3491 URL: https://issues.apache.org/jira/browse/TEZ-3491 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.1 Reporter: Jason Lowe Priority: Critical If the Tez AM receives containers at a lower priority than the highest priority task being requested then it fails to assign the container to any task. In addition if the container is new then it refuses to release it if there are any pending tasks. If it takes too long for the higher priority requests to be fulfilled (e.g.: the lower priority containers are filling the queue) then eventually YARN will expire the unused lower priority containers since they were never launched. The Tez AM then never re-requests these lower priority containers and the job hangs because the AM is waiting for containers from the RM that the RM already sent and expired. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3462) Task attempt failure during container shutdown loses useful container diagnostics
Jason Lowe created TEZ-3462: --- Summary: Task attempt failure during container shutdown loses useful container diagnostics Key: TEZ-3462 URL: https://issues.apache.org/jira/browse/TEZ-3462 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.1 Reporter: Jason Lowe When a nodemanager kills a task attempt due to excessive memory usage it will send a SIGTERM followed by a SIGKILL. It also sends a useful diagnostic message with the container completion event to the RM which will eventually make it to the AM on a subsequent heartbeat. However if the JVM shutdown processing causes an error in the task (e.g.: filesystem being closed by shutdown hook) then the task attempt can report a failure before the useful NM diagnostic makes it to the AM. The AM then records some other error as the task failure reason, and by the time the container completion status makes it to the AM it does not associate that error with the task attempt and the useful information is lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3444) Handling of fetch-failures should consider time spent producing output
Jason Lowe created TEZ-3444: --- Summary: Handling of fetch-failures should consider time spent producing output Key: TEZ-3444 URL: https://issues.apache.org/jira/browse/TEZ-3444 Project: Apache Tez Issue Type: Improvement Reporter: Jason Lowe When handling fetch failures and deciding whether the upstream task should be re-run, we should consider the duration of the upstream task that generated the data trying to be fetched. If the upstream task ran for a long time then we may want to retry a bit harder before deciding to re-run. If the upstream task executed in a few seconds then we should probably re-run the upstream task more aggressively since that may be cheaper than multiple retries that timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3415) Ability to configure shuffle server listen queue length
Jason Lowe created TEZ-3415: --- Summary: Ability to configure shuffle server listen queue length Key: TEZ-3415 URL: https://issues.apache.org/jira/browse/TEZ-3415 Project: Apache Tez Issue Type: Sub-task Reporter: Jason Lowe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-3336) Hive map-side join job sometimes fails with ROOT_INPUT_INIT_FAILURE
[ https://issues.apache.org/jira/browse/TEZ-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe resolved TEZ-3336. - Resolution: Invalid Closing this as invalid since it seems like a problem with Hive's use of Tez rather than Tez itself. [~mithun] please reopen with details if you find otherwise. > Hive map-side join job sometimes fails with ROOT_INPUT_INIT_FAILURE > --- > > Key: TEZ-3336 > URL: https://issues.apache.org/jira/browse/TEZ-3336 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1 >Reporter: Jason Lowe > > When Hive does a map-side join it can generate a DAG where a vertex has two > inputs, one from an upstream task and another using MRInputAMSplitGenerator. > If it takes a while for MRInputAMSplitGenerator to compute the splits and one > of the tasks for the other upstream vertex completes then the job can fail > with an error since MRInputAMSplitGenerator does not expect to receive any > events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3368) NPE in DelayedContainerManager
Jason Lowe created TEZ-3368: --- Summary: NPE in DelayedContainerManager Key: TEZ-3368 URL: https://issues.apache.org/jira/browse/TEZ-3368 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.1 Reporter: Jason Lowe Saw a Tez AM hang due to an NPE in the DelayedContainerManager: {noformat} 2016-07-17 01:53:23,157 [ERROR] [DelayedContainerManager] |yarn.YarnUncaughtExceptionHandler|: Thread Thread[DelayedContainerManager,5,main] threw an Exception. java.lang.NullPointerException at org.apache.tez.dag.app.rm.TezAMRMClientAsync.getMatchingRequestsForTopPriority(TezAMRMClientAsync.java:142) at org.apache.tez.dag.app.rm.YarnTaskSchedulerService.getMatchingRequestWithoutPriority(YarnTaskSchedulerService.java:1474) at org.apache.tez.dag.app.rm.YarnTaskSchedulerService.access$500(YarnTaskSchedulerService.java:84) at org.apache.tez.dag.app.rm.YarnTaskSchedulerService$NodeLocalContainerAssigner.assignReUsedContainer(YarnTaskSchedulerService.java:1869) at org.apache.tez.dag.app.rm.YarnTaskSchedulerService.assignReUsedContainerWithLocation(YarnTaskSchedulerService.java:1753) at org.apache.tez.dag.app.rm.YarnTaskSchedulerService.assignDelayedContainer(YarnTaskSchedulerService.java:733) at org.apache.tez.dag.app.rm.YarnTaskSchedulerService.access$600(YarnTaskSchedulerService.java:84) at org.apache.tez.dag.app.rm.YarnTaskSchedulerService$DelayedContainerManager.run(YarnTaskSchedulerService.java:2030) {noformat} After the DelayedContainerManager thread exited the AM proceeded to receive requested containers that would go unused until the container allocations expired. Then they would be re-requested, and the cycle repeated indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3350) Shuffle spills are not spilled to a container-specific directory
Jason Lowe created TEZ-3350: --- Summary: Shuffle spills are not spilled to a container-specific directory Key: TEZ-3350 URL: https://issues.apache.org/jira/browse/TEZ-3350 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.1 Reporter: Jason Lowe If a Tez task receives too much input data and needs to spill the inputs to disk it ends up using a path that is not container-specific. Therefore YARN will not automatically cleanup these files when the container exits as it should, and instead the files linger until the entire application completes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3306) Improve container priority assignments for vertices
Jason Lowe created TEZ-3306: --- Summary: Improve container priority assignments for vertices Key: TEZ-3306 URL: https://issues.apache.org/jira/browse/TEZ-3306 Project: Apache Tez Issue Type: Improvement Reporter: Jason Lowe After TEZ-3296 the priority space is sparsely used. We should consider doing a breadth-first traversal of the DAG or reusing the client-side topological sorting to allow a more efficient use of the priority space. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements
Jason Lowe created TEZ-3296: --- Summary: Tez job can hang if two vertices at the same root distance have different task requirements Key: TEZ-3296 URL: https://issues.apache.org/jira/browse/TEZ-3296 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.1 Reporter: Jason Lowe Priority: Critical When two vertices have the same distance from the root Tez will schedule containers with the same priority. However those vertices could have different task requirements and therefore different capabilities. As documented in YARN-314, YARN currently doesn't support requests for multiple sizes at the same priority. In practice this leads to one vertex allocation requests clobbering the other, and that can result in a situation where the Tez AM is waiting on containers it will never receive from the RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3293) Fetch failures can cause a shuffle hang waiting for memory merge that never starts
Jason Lowe created TEZ-3293: --- Summary: Fetch failures can cause a shuffle hang waiting for memory merge that never starts Key: TEZ-3293 URL: https://issues.apache.org/jira/browse/TEZ-3293 Project: Apache Tez Issue Type: Bug Affects Versions: 0.8.3, 0.7.1 Reporter: Jason Lowe Assignee: Jason Lowe Tez jobs can hang in shuffle waiting for a memory merge that never starts. When a MapOutput is reserved it increments usedMemory but when it is unreserved it decrements usedMemory _and_ commitMemory. If enough shuffle failures occur of sufficient size then commitMemory may never reach the merge threshold even after all outstanding transfers have committed and thus hang the shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3260) Ability to disable IFile checksum verification during shuffle transfers
Jason Lowe created TEZ-3260: --- Summary: Ability to disable IFile checksum verification during shuffle transfers Key: TEZ-3260 URL: https://issues.apache.org/jira/browse/TEZ-3260 Project: Apache Tez Issue Type: Improvement Reporter: Jason Lowe In TEZ-3237 [~rajesh.balamohan] requested the ability to avoid the computational expense of verifying IFile checksums during shuffle transfers for cases where the user is not concerned about data corruption and would like the additional performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3246) Improve diagnostics when DAG killed by user
Jason Lowe created TEZ-3246: --- Summary: Improve diagnostics when DAG killed by user Key: TEZ-3246 URL: https://issues.apache.org/jira/browse/TEZ-3246 Project: Apache Tez Issue Type: Improvement Reporter: Jason Lowe It would be nice if the DAG diagnostics included the user and host that originated the kill request for a DAG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3244) Allow overlap of input and output memory when they are not concurrent
Jason Lowe created TEZ-3244: --- Summary: Allow overlap of input and output memory when they are not concurrent Key: TEZ-3244 URL: https://issues.apache.org/jira/browse/TEZ-3244 Project: Apache Tez Issue Type: Bug Reporter: Jason Lowe For cases when memory for inputs and outputs are not needed simultaneously it would be more efficient to allow inputs to use the memory normally set aside for outputs and vice-versa. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3237) Corrupted shuffle transfers to disk are not detected during transfer
Jason Lowe created TEZ-3237: --- Summary: Corrupted shuffle transfers to disk are not detected during transfer Key: TEZ-3237 URL: https://issues.apache.org/jira/browse/TEZ-3237 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Jason Lowe When a shuffle transfer is larger than the single transfer limit it gets written straight to disk during the transfer. Unfortunately there are no checksum validations performed during that transfer, so if the data is corrupted at the source or during transmit it goes undetected. Only later when the task tries to consume the transferred data is the error detected, but at that point it's too late to blame the source task for the error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3213) Uncaught exception during vertex recovery leads to invalid state transition loop
Jason Lowe created TEZ-3213: --- Summary: Uncaught exception during vertex recovery leads to invalid state transition loop Key: TEZ-3213 URL: https://issues.apache.org/jira/browse/TEZ-3213 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Jason Lowe If an uncaught exception occurs during a state transition from the RECOVERING vertex then V_INTERNAL_ERROR will be delivered to the state machine, but that event is not handled in the RECOVERING state. That in turn causes a V_INTERNAL_ERROR event to be delivered to the state machine, and it loops logging the invalid transitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3203) DAG hangs when one of the upstream vertices has zero tasks
Jason Lowe created TEZ-3203: --- Summary: DAG hangs when one of the upstream vertices has zero tasks Key: TEZ-3203 URL: https://issues.apache.org/jira/browse/TEZ-3203 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Jason Lowe Priority: Critical A DAG hangs during execution if it has a vertex with multiple inputs and one of those upstream vertices has zero tasks and is using ShuffleVertexManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3193) Deadlock in AM during task commit request
Jason Lowe created TEZ-3193: --- Summary: Deadlock in AM during task commit request Key: TEZ-3193 URL: https://issues.apache.org/jira/browse/TEZ-3193 Project: Apache Tez Issue Type: Bug Affects Versions: 0.8.2, 0.7.1 Reporter: Jason Lowe Priority: Blocker The AM can deadlock between TaskImpl and TaskAttemptImpl. Stacktrace and details in a followup comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3191) NM container diagnostics for excess resource usage can be lost if task fails while being killed
Jason Lowe created TEZ-3191: --- Summary: NM container diagnostics for excess resource usage can be lost if task fails while being killed Key: TEZ-3191 URL: https://issues.apache.org/jira/browse/TEZ-3191 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Jason Lowe This is the Tez version of MAPREDUCE-4955. I saw a misconfigured Tez job report a task attempt as failed due to a filesystem closed error because the NM killed the container due to excess memory usage. Unfortunately the SIGTERM sent by the NM caused the filesystem shutdown hook to close the filesystems, and that triggered a failure in the main thread. If the failure is reported to the AM via the umbilical before the NM container status is received via the RM then the useful container diagnostics from the NM are lost in the job history. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3167) TestRecovery occasionally times out
Jason Lowe created TEZ-3167: --- Summary: TestRecovery occasionally times out Key: TEZ-3167 URL: https://issues.apache.org/jira/browse/TEZ-3167 Project: Apache Tez Issue Type: Bug Reporter: Jason Lowe TestRecovery has been timing out sporadically in precommit builds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3141) mapreduce.task.timeout is not translated to container heartbeat timeout
Jason Lowe created TEZ-3141: --- Summary: mapreduce.task.timeout is not translated to container heartbeat timeout Key: TEZ-3141 URL: https://issues.apache.org/jira/browse/TEZ-3141 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.1 Reporter: Jason Lowe Assignee: Jason Lowe TEZ-2966 added the deprecation to the runtime key map, but the container timeout is an AM-level property and therefore the runtime map translation is missed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3114) Shuffle OOM due to EventMetaData flood
Jason Lowe created TEZ-3114: --- Summary: Shuffle OOM due to EventMetaData flood Key: TEZ-3114 URL: https://issues.apache.org/jira/browse/TEZ-3114 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Jason Lowe A task encountered an OOM during shuffle, and investigation of the heap dump showed a lot of memory being consumed by almost 3.5 million EventMetaData objects. Auto-parallelism had reduced the number of tasks in the vertex to 1 and there were 2000 upstream tasks to shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3115) Shuffle string handling adds significant memory overhead
Jason Lowe created TEZ-3115: --- Summary: Shuffle string handling adds significant memory overhead Key: TEZ-3115 URL: https://issues.apache.org/jira/browse/TEZ-3115 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Jason Lowe While investigating the OOM heap dump from TEZ-3114 I noticed that the ShuffleManager and other shuffle-related objects were holding onto many strings that added up to over a hundred megabytes of memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3102) Fetch failure of a speculated task causes job hang
Jason Lowe created TEZ-3102: --- Summary: Fetch failure of a speculated task causes job hang Key: TEZ-3102 URL: https://issues.apache.org/jira/browse/TEZ-3102 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical If a task speculates then succeeds, one task will be marked successful and the other killed. Then if the task retroactively fails due to fetch failures the Tez AM will fail to reschedule another task. This results in a hung job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3066) TaskAttemptFinishedEvent ConcurrentModificationException if processed by RecoveryService and history logging simultaneously
Jason Lowe created TEZ-3066: --- Summary: TaskAttemptFinishedEvent ConcurrentModificationException if processed by RecoveryService and history logging simultaneously Key: TEZ-3066 URL: https://issues.apache.org/jira/browse/TEZ-3066 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Jason Lowe A ConcurrentModificationException can occur if a TaskAttemptFinishedEvent is processed simultaneously by the recovery service and another history logging service. Sample stacktraces to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3051) Vertex failed with invalid event DAG_VERTEX_RERUNNING at SUCCEEDED
Jason Lowe created TEZ-3051: --- Summary: Vertex failed with invalid event DAG_VERTEX_RERUNNING at SUCCEEDED Key: TEZ-3051 URL: https://issues.apache.org/jira/browse/TEZ-3051 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Jason Lowe I saw a job fail due to an internal error on a vertex: org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: DAG_VERTEX_RERUNNING at SUCCEEDED Stacktrace to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3009) Errors that occur during container task acquisition are not logged
Jason Lowe created TEZ-3009: --- Summary: Errors that occur during container task acquisition are not logged Key: TEZ-3009 URL: https://issues.apache.org/jira/browse/TEZ-3009 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Jason Lowe If TezChild encounters an error while trying to obtain a task the error will be silently handled. This results in a mysterious shutdown of containers with no cause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3010) Container task acquisition has no retries for errors
Jason Lowe created TEZ-3010: --- Summary: Container task acquisition has no retries for errors Key: TEZ-3010 URL: https://issues.apache.org/jira/browse/TEZ-3010 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0 Reporter: Jason Lowe There's no retries for errors that occur during task acquisition. If any error occurs the container will just shut down, resulting in task attempt failures if a task attempt happened to be assigned to the container by the AM. The container should try harder to obtain the task before giving up. -- This message was sent by Atlassian JIRA (v6.3.4#6332)