[jira] [Resolved] (TEZ-3982) DAGAppMaster and tasks should not report negative or invalid progress

2018-09-21 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved TEZ-3982.
-
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 0.9.2

Thanks, [~kshukla]!  I committed this to branch-0.9.

> DAGAppMaster and tasks should not report negative or invalid progress
> -
>
> Key: TEZ-3982
> URL: https://issues.apache.org/jira/browse/TEZ-3982
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1, 0.10.0
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
>Priority: Major
> Fix For: 0.9.2, 0.10.1
>
> Attachments: TEZ-3982.001.patch, TEZ-3982.002.patch, 
> TEZ-3982.003.patch, TEZ-3982.004.patch, TEZ-3982.005.branch-0.9.patch
>
>
> AM fails (AMRMClient expects non negative progress) if any component reports 
> invalid or -ve progress, DagAppMaster/Tasks should check and report 
> accordingly to allow the AM to execute.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (TEZ-3982) DAGAppMaster and tasks should not report negative or invalid progress

2018-09-21 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reopened TEZ-3982:
-

This broke the branch-0.9 build.  Looks like MonotonicClock isn't in the 
version of Hadoop branch-0.9 depends upon:
{noformat}
[ERROR] 
/tez/tez-dag/src/test/java/org/apache/tez/dag/app/TestDAGAppMaster.java:[17,35] 
cannot find symbol
  symbol:   class MonotonicClock
  location: package org.apache.hadoop.yarn.util
[ERROR] 
/tez/tez-dag/src/test/java/org/apache/tez/dag/app/TestDAGAppMaster.java:[431,32]
 cannot find symbol
  symbol:   class MonotonicClock
  location: class org.apache.tez.dag.app.TestDAGAppMaster
[INFO] 2 errors 
{noformat}

I reverted this from branch-0.9 to fix the build.

> DAGAppMaster and tasks should not report negative or invalid progress
> -
>
> Key: TEZ-3982
> URL: https://issues.apache.org/jira/browse/TEZ-3982
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1, 0.10.0
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
>Priority: Major
> Fix For: 0.10.1
>
> Attachments: TEZ-3982.001.patch, TEZ-3982.002.patch, 
> TEZ-3982.003.patch, TEZ-3982.004.patch
>
>
> AM fails (AMRMClient expects non negative progress) if any component reports 
> invalid or -ve progress, DagAppMaster/Tasks should check and report 
> accordingly to allow the AM to execute.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TEZ-3989) Fix by-laws related to emeritus clause

2018-09-13 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved TEZ-3989.
-
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 0.10.1

Thanks, [~hitesh]! I committed this to master.

> Fix by-laws related to emeritus clause 
> ---
>
> Key: TEZ-3989
> URL: https://issues.apache.org/jira/browse/TEZ-3989
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
>Priority: Major
> Fix For: 0.10.1
>
>
> The emeritus clause is not valid and needs to be updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TEZ-3935) DAG aware scheduler should release unassigned new containers rather than hold them

2018-05-14 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3935:
---

 Summary: DAG aware scheduler should release unassigned new 
containers rather than hold them
 Key: TEZ-3935
 URL: https://issues.apache.org/jira/browse/TEZ-3935
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe
Assignee: Jason Lowe


I saw a case for a very large job with many containers where the DAG aware 
scheduler was getting behind on assigning containers.  Newly assigned 
containers were not finding any matching request, so they were queued for reuse 
processing.  However it took so long to get through all of the task and 
container events that the container allocations expired before the container 
was finally assigned and attempted to be launched.

Newly assigned containers are assigned to their matching requests, even if that 
violates the DAG priorities, so it should be safe to simply release these if no 
tasks could be found to use them.  The matching request has either been removed 
or already satisified with a reused container.  Besides, if we can't find any 
tasks to take the newly assigned container then it is very likely we have 
plenty of reusable containers already, and keeping more containers just makes 
the job a resource hog on the cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (TEZ-3913) Precommit build fails to post to JIRA

2018-04-09 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reopened TEZ-3913:
-

> Precommit build fails to post to JIRA
> -
>
> Key: TEZ-3913
> URL: https://issues.apache.org/jira/browse/TEZ-3913
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Major
> Fix For: 0.9.2
>
> Attachments: TEZ-3913.001.patch
>
>
> The precommit build is failing to post comments to Jira due to a 404 error:
> {noformat}
> Unable to log in to server: 
> https://issues.apache.org/jira/rpc/soap/jirasoapservice-v2 with user: tezqa.
>  Cause: (404)404
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TEZ-3913) Precommit build fails to post to JIRA

2018-04-09 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3913:
---

 Summary: Precommit build fails to post to JIRA
 Key: TEZ-3913
 URL: https://issues.apache.org/jira/browse/TEZ-3913
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe
Assignee: Jason Lowe


The precommit build is failing to post comments to Jira due to a 404 error:
{noformat}
Unable to log in to server: 
https://issues.apache.org/jira/rpc/soap/jirasoapservice-v2 with user: tezqa.
 Cause: (404)404
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TEZ-3898) TestTezCommonUtils fails when compiled against hadoop version >= 2.8

2018-02-16 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3898:
---

 Summary: TestTezCommonUtils fails when compiled against hadoop 
version >= 2.8
 Key: TEZ-3898
 URL: https://issues.apache.org/jira/browse/TEZ-3898
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe
Assignee: Jason Lowe


TestTezCommonUtils fails when compiled against hadoop 2.8 or later:
{noformat}
$ cd tez-api
$ mvn test -Phadoop28 -P-hadoop27 -Dhadoop.version=2.8.3
-Dtest=TestTezCommonUtilsRunning org.apache.tez.common.TestTezCommonUtils
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.266 sec <<< 
FAILURE!
org.apache.tez.common.TestTezCommonUtils  Time elapsed: 0.265 sec  <<< ERROR!
java.lang.NoClassDefFoundError: 
org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetFactory
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
org.apache.hadoop.hdfs.server.datanode.FsDatasetTestUtils$Factory.getFactory(FsDatasetTestUtils.java:47)
at 
org.apache.hadoop.hdfs.MiniDFSCluster$Builder.(MiniDFSCluster.java:199)
at 
org.apache.tez.common.TestTezCommonUtils.setup(TestTezCommonUtils.java:60)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TEZ-3896) TestATSV15HistoryLoggingService#testNonSessionDomains is failing

2018-02-16 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3896:
---

 Summary: TestATSV15HistoryLoggingService#testNonSessionDomains is 
failing
 Key: TEZ-3896
 URL: https://issues.apache.org/jira/browse/TEZ-3896
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe
Assignee: Jason Lowe


TestATSV15HistoryLoggingService always fails:
{noformat}
Running org.apache.tez.dag.history.logging.ats.TestATSV15HistoryLoggingService
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.789 sec <<< 
FAILURE!
testNonSessionDomains(org.apache.tez.dag.history.logging.ats.TestATSV15HistoryLoggingService)
  Time elapsed: 0.477 sec  <<< FAILURE!
org.mockito.exceptions.verification.TooManyActualInvocations: 
historyACLPolicyManager.updateTimelineEntityDomain(
,
"session-id"
);
Wanted 5 times:
-> at 
org.apache.tez.dag.history.logging.ats.TestATSV15HistoryLoggingService.testNonSessionDomains(TestATSV15HistoryLoggingService.java:231)
But was 6 times. Undesired invocation:
-> at 
org.apache.tez.dag.history.logging.ats.ATSV15HistoryLoggingService.logEntity(ATSV15HistoryLoggingService.java:389)

at 
org.apache.tez.dag.history.logging.ats.TestATSV15HistoryLoggingService.testNonSessionDomains(TestATSV15HistoryLoggingService.java:231)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TEZ-3821) Ability to fail fast tasks that write too much to local disk

2017-08-21 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3821:
---

 Summary: Ability to fail fast tasks that write too much to local 
disk
 Key: TEZ-3821
 URL: https://issues.apache.org/jira/browse/TEZ-3821
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe


It would be nice to have a configurable limit such that any task that wrote 
data to the local filesystem beyond that limit would fail quickly rather than 
waiting for the disk to fill much later, impacting other jobs on the cluster.

This is essentially asking for the Tez version of MAPREDUCE-6489.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TEZ-3770) DAG-aware YARN task scheduler

2017-06-22 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3770:
---

 Summary: DAG-aware YARN task scheduler
 Key: TEZ-3770
 URL: https://issues.apache.org/jira/browse/TEZ-3770
 Project: Apache Tez
  Issue Type: New Feature
Reporter: Jason Lowe
Assignee: Jason Lowe


There are cases where priority alone does not convey the relationship between 
tasks, and this can cause problems when scheduling or preempting tasks.  If the 
YARN task scheduler was aware of the relationship between tasks then it could 
make smarter decisions when trying to assign tasks to containers or preempt 
running tasks to schedule pending tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TEZ-3744) Findbug warnings after TEZ-3334 merge

2017-05-25 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3744:
---

 Summary: Findbug warnings after TEZ-3334 merge
 Key: TEZ-3744
 URL: https://issues.apache.org/jira/browse/TEZ-3744
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Jason Lowe


There are findbug warnings in precommit builds that appear to be caused by the 
recent TEZ-3334 merge.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (TEZ-3741) Tez outputs should free memory when closed

2017-05-25 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3741:
---

 Summary: Tez outputs should free memory when closed
 Key: TEZ-3741
 URL: https://issues.apache.org/jira/browse/TEZ-3741
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.1, 0.9.0
Reporter: Jason Lowe
Assignee: Jason Lowe


Memory buffers aren't being released as quickly as they could be, e.g.: 
DefaultSorter is holding onto the very large kvbuffer byte array even after 
close() is called, and Ordered and Unordered outputs should remove references 
to sorter and kvWriter in their close.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (TEZ-3738) TestUnorderedPartitionedKVWriter fails due to RejectedExecutionException

2017-05-25 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved TEZ-3738.
-
Resolution: Duplicate

> TestUnorderedPartitionedKVWriter fails due to RejectedExecutionException
> 
>
> Key: TEZ-3738
> URL: https://issues.apache.org/jira/browse/TEZ-3738
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jason Lowe
>
> TestUnorderedPartitionedKVWriter is failing in recent precommit builds.  
> Stacktrace to follow.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (TEZ-3702) Tez shuffle jar includes service loader entry for ClientProtocolProvider but not the corresponding class

2017-04-27 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved TEZ-3702.
-
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: TEZ-3334

Thanks for the reviews!  I committed this to the TEZ-3334 branch.

> Tez shuffle jar includes service loader entry for ClientProtocolProvider but 
> not the corresponding class
> 
>
> Key: TEZ-3702
> URL: https://issues.apache.org/jira/browse/TEZ-3702
> Project: Apache Tez
>  Issue Type: Sub-task
>Affects Versions: TEZ-3334
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Fix For: TEZ-3334
>
> Attachments: TEZ-3702.001.patch
>
>
> The tez-aux-shuffle jar is shading the tez-mapreduce dependency but that 
> causes the service loader entry for 
> org.apache.hadoop.mapreduce.protocol.ClientProtocolProvider to be included 
> without including the referenced 
> org.apache.tez.mapreduce.client.YarnTezClientProtocolProvider class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (TEZ-3695) TestTezSharedExecutor fails sporadically

2017-04-24 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3695:
---

 Summary: TestTezSharedExecutor fails sporadically
 Key: TEZ-3695
 URL: https://issues.apache.org/jira/browse/TEZ-3695
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe


TestTezSharedExecutor#testSerialExecution is timing out more often than not for 
me when running the full TezTezSharedExecutor test suite.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (TEZ-3693) ControlledClock is not used

2017-04-21 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3693:
---

 Summary: ControlledClock is not used
 Key: TEZ-3693
 URL: https://issues.apache.org/jira/browse/TEZ-3693
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe
Priority: Trivial


The org.apache.tez.dag.app.ControlledClock class is not referenced in the 
source.  Oddly this is not a test class, like MockClock, as I would have 
expected.  If this is not part of the Tez API then it can be removed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (TEZ-3535) YarnTaskScheduler can hold onto low priority containers until they expire

2016-11-10 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3535:
---

 Summary: YarnTaskScheduler can hold onto low priority containers 
until they expire
 Key: TEZ-3535
 URL: https://issues.apache.org/jira/browse/TEZ-3535
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.8.4, 0.7.1
Reporter: Jason Lowe
Assignee: Jason Lowe


With container reuse enabled, YarnTaskScheduler will retain but not schedule 
any container allocations that are lower priority than the highest priority 
task requests.  This can lead to poor performance as these lower priority 
containers clog up resources needed for high priority allocations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3508) TestTaskScheduler cleanup

2016-11-02 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3508:
---

 Summary: TestTaskScheduler cleanup
 Key: TEZ-3508
 URL: https://issues.apache.org/jira/browse/TEZ-3508
 Project: Apache Tez
  Issue Type: Test
Reporter: Jason Lowe
Assignee: Jason Lowe


TestTaskScheduler is very fragile, since it builds mocks of the AMRM client 
that is tied very specifically to the particulars of the way the 
YarnTaskScheduler is coded.  Any variance in that often leads to test failures 
because the mocks no longer accurately reflect what the real AMRM client does.

It would be much simpler and more robust to leverage the AMRMClientForTest and 
AMRMAsyncClientForTest classes in TestTaskSchedulerHelpers rather than maintain 
fragile mocks attempting to emulate the behaviors of those classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3491) Tez job can hang due to container priority inversion

2016-10-25 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3491:
---

 Summary: Tez job can hang due to container priority inversion
 Key: TEZ-3491
 URL: https://issues.apache.org/jira/browse/TEZ-3491
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.1
Reporter: Jason Lowe
Priority: Critical


If the Tez AM receives containers at a lower priority than the highest priority 
task being requested then it fails to assign the container to any task.  In 
addition if the container is new then it refuses to release it if there are any 
pending tasks.  If it takes too long for the higher priority requests to be 
fulfilled (e.g.: the lower priority containers are filling the queue) then 
eventually YARN will expire the unused lower priority containers since they 
were never launched.  The Tez AM then never re-requests these lower priority 
containers and the job hangs because the AM is waiting for containers from the 
RM that the RM already sent and expired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3462) Task attempt failure during container shutdown loses useful container diagnostics

2016-10-06 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3462:
---

 Summary: Task attempt failure during container shutdown loses 
useful container diagnostics
 Key: TEZ-3462
 URL: https://issues.apache.org/jira/browse/TEZ-3462
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.1
Reporter: Jason Lowe


When a nodemanager kills a task attempt due to excessive memory usage it will 
send a SIGTERM followed by a SIGKILL.  It also sends a useful diagnostic 
message with the container completion event to the RM which will eventually 
make it to the AM on a subsequent heartbeat.

However if the JVM shutdown processing causes an error in the task (e.g.: 
filesystem being closed by shutdown hook) then the task attempt can report a 
failure before the useful NM diagnostic makes it to the AM.  The AM then 
records some other error as the task failure reason, and by the time the 
container completion status makes it to the AM it does not associate that error 
with the task attempt and the useful information is lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3444) Handling of fetch-failures should consider time spent producing output

2016-09-22 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3444:
---

 Summary: Handling of fetch-failures should consider time spent 
producing output
 Key: TEZ-3444
 URL: https://issues.apache.org/jira/browse/TEZ-3444
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Jason Lowe


When handling fetch failures and deciding whether the upstream task should be 
re-run, we should consider the duration of the upstream task that generated the 
data trying to be fetched.  If the upstream task ran for a long time then we 
may want to retry a bit harder before deciding to re-run.  If the upstream task 
executed in a few seconds then we should probably re-run the upstream task more 
aggressively since that may be cheaper than multiple retries that timeout.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3415) Ability to configure shuffle server listen queue length

2016-08-19 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3415:
---

 Summary: Ability to configure shuffle server listen queue length
 Key: TEZ-3415
 URL: https://issues.apache.org/jira/browse/TEZ-3415
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Jason Lowe






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TEZ-3336) Hive map-side join job sometimes fails with ROOT_INPUT_INIT_FAILURE

2016-08-09 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved TEZ-3336.
-
Resolution: Invalid

Closing this as invalid since it seems like a problem with Hive's use of Tez 
rather than Tez itself.  [~mithun] please reopen with details if you find 
otherwise.

> Hive map-side join job sometimes fails with ROOT_INPUT_INIT_FAILURE
> ---
>
> Key: TEZ-3336
> URL: https://issues.apache.org/jira/browse/TEZ-3336
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Jason Lowe
>
> When Hive does a map-side join it can generate a DAG where a vertex has two 
> inputs, one from an upstream task and another using MRInputAMSplitGenerator.  
> If it takes a while for MRInputAMSplitGenerator to compute the splits and one 
> of the tasks for the other upstream vertex completes then the job can fail 
> with an error since MRInputAMSplitGenerator does not expect to receive any 
> events.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3368) NPE in DelayedContainerManager

2016-07-20 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3368:
---

 Summary: NPE in DelayedContainerManager
 Key: TEZ-3368
 URL: https://issues.apache.org/jira/browse/TEZ-3368
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.1
Reporter: Jason Lowe


Saw a Tez AM hang due to an NPE in the DelayedContainerManager:
{noformat}
2016-07-17 01:53:23,157 [ERROR] [DelayedContainerManager] 
|yarn.YarnUncaughtExceptionHandler|: Thread 
Thread[DelayedContainerManager,5,main] threw an Exception.
java.lang.NullPointerException
at 
org.apache.tez.dag.app.rm.TezAMRMClientAsync.getMatchingRequestsForTopPriority(TezAMRMClientAsync.java:142)
at 
org.apache.tez.dag.app.rm.YarnTaskSchedulerService.getMatchingRequestWithoutPriority(YarnTaskSchedulerService.java:1474)
at 
org.apache.tez.dag.app.rm.YarnTaskSchedulerService.access$500(YarnTaskSchedulerService.java:84)
at 
org.apache.tez.dag.app.rm.YarnTaskSchedulerService$NodeLocalContainerAssigner.assignReUsedContainer(YarnTaskSchedulerService.java:1869)
at 
org.apache.tez.dag.app.rm.YarnTaskSchedulerService.assignReUsedContainerWithLocation(YarnTaskSchedulerService.java:1753)
at 
org.apache.tez.dag.app.rm.YarnTaskSchedulerService.assignDelayedContainer(YarnTaskSchedulerService.java:733)
at 
org.apache.tez.dag.app.rm.YarnTaskSchedulerService.access$600(YarnTaskSchedulerService.java:84)
at 
org.apache.tez.dag.app.rm.YarnTaskSchedulerService$DelayedContainerManager.run(YarnTaskSchedulerService.java:2030)
{noformat}

After the DelayedContainerManager thread exited the AM proceeded to receive 
requested containers that would go unused until the container allocations 
expired.  Then they would be re-requested, and the cycle repeated indefinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3350) Shuffle spills are not spilled to a container-specific directory

2016-07-14 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3350:
---

 Summary: Shuffle spills are not spilled to a container-specific 
directory
 Key: TEZ-3350
 URL: https://issues.apache.org/jira/browse/TEZ-3350
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.1
Reporter: Jason Lowe


If a Tez task receives too much input data and needs to spill the inputs to 
disk it ends up using a path that is not container-specific.  Therefore YARN 
will not automatically cleanup these files when the container exits as it 
should, and instead the files linger until the entire application completes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3306) Improve container priority assignments for vertices

2016-06-16 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3306:
---

 Summary: Improve container priority assignments for vertices
 Key: TEZ-3306
 URL: https://issues.apache.org/jira/browse/TEZ-3306
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Jason Lowe


After TEZ-3296 the priority space is sparsely used.  We should consider doing a 
breadth-first traversal of the DAG or reusing the client-side topological 
sorting to allow a more efficient use of the priority space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements

2016-06-09 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3296:
---

 Summary: Tez job can hang if two vertices at the same root 
distance have different task requirements
 Key: TEZ-3296
 URL: https://issues.apache.org/jira/browse/TEZ-3296
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.1
Reporter: Jason Lowe
Priority: Critical


When two vertices have the same distance from the root Tez will schedule 
containers with the same priority.  However those vertices could have different 
task requirements and therefore different capabilities.  As documented in 
YARN-314, YARN currently doesn't support requests for multiple sizes at the 
same priority.  In practice this leads to one vertex allocation requests 
clobbering the other, and that can result in a situation where the Tez AM is 
waiting on containers it will never receive from the RM.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3293) Fetch failures can cause a shuffle hang waiting for memory merge that never starts

2016-06-08 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3293:
---

 Summary: Fetch failures can cause a shuffle hang waiting for 
memory merge that never starts
 Key: TEZ-3293
 URL: https://issues.apache.org/jira/browse/TEZ-3293
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.8.3, 0.7.1
Reporter: Jason Lowe
Assignee: Jason Lowe


Tez jobs can hang in shuffle waiting for a memory merge that never starts.  
When a MapOutput is reserved it increments usedMemory but when it is unreserved 
it decrements usedMemory _and_ commitMemory.  If enough shuffle failures occur 
of sufficient size then commitMemory may never reach the merge threshold even 
after all outstanding transfers have committed and thus hang the shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3260) Ability to disable IFile checksum verification during shuffle transfers

2016-05-16 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3260:
---

 Summary: Ability to disable IFile checksum verification during 
shuffle transfers
 Key: TEZ-3260
 URL: https://issues.apache.org/jira/browse/TEZ-3260
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Jason Lowe


In TEZ-3237 [~rajesh.balamohan] requested the ability to avoid the 
computational expense of verifying IFile checksums during shuffle transfers for 
cases where the user is not concerned about data corruption and would like the 
additional performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3246) Improve diagnostics when DAG killed by user

2016-05-06 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3246:
---

 Summary: Improve diagnostics when DAG killed by user
 Key: TEZ-3246
 URL: https://issues.apache.org/jira/browse/TEZ-3246
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Jason Lowe


It would be nice if the DAG diagnostics included the user and host that 
originated the kill request for a DAG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3244) Allow overlap of input and output memory when they are not concurrent

2016-05-06 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3244:
---

 Summary: Allow overlap of input and output memory when they are 
not concurrent
 Key: TEZ-3244
 URL: https://issues.apache.org/jira/browse/TEZ-3244
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe


For cases when memory for inputs and outputs are not needed simultaneously it 
would be more efficient to allow inputs to use the memory normally set aside 
for outputs and vice-versa.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3237) Corrupted shuffle transfers to disk are not detected during transfer

2016-04-29 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3237:
---

 Summary: Corrupted shuffle transfers to disk are not detected 
during transfer
 Key: TEZ-3237
 URL: https://issues.apache.org/jira/browse/TEZ-3237
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe


When a shuffle transfer is larger than the single transfer limit it gets 
written straight to disk during the transfer.  Unfortunately there are no 
checksum validations performed during that transfer, so if the data is 
corrupted at the source or during transmit it goes undetected.  Only later when 
the task tries to consume the transferred data is the error detected, but at 
that point it's too late to blame the source task for the error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3213) Uncaught exception during vertex recovery leads to invalid state transition loop

2016-04-13 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3213:
---

 Summary: Uncaught exception during vertex recovery leads to 
invalid state transition loop
 Key: TEZ-3213
 URL: https://issues.apache.org/jira/browse/TEZ-3213
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe


If an uncaught exception occurs during a state transition from the RECOVERING 
vertex then V_INTERNAL_ERROR will be delivered to the state machine, but that 
event is not handled in the RECOVERING state.  That in turn causes a 
V_INTERNAL_ERROR event to be delivered to the state machine, and it loops 
logging the invalid transitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3203) DAG hangs when one of the upstream vertices has zero tasks

2016-04-07 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3203:
---

 Summary: DAG hangs when one of the upstream vertices has zero tasks
 Key: TEZ-3203
 URL: https://issues.apache.org/jira/browse/TEZ-3203
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe
Priority: Critical


A DAG hangs during execution if it has a vertex with multiple inputs and one of 
those upstream vertices has zero tasks and is using ShuffleVertexManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3193) Deadlock in AM during task commit request

2016-03-31 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3193:
---

 Summary: Deadlock in AM during task commit request
 Key: TEZ-3193
 URL: https://issues.apache.org/jira/browse/TEZ-3193
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.8.2, 0.7.1
Reporter: Jason Lowe
Priority: Blocker


The AM can deadlock between TaskImpl and TaskAttemptImpl.  Stacktrace and 
details in a followup comment.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3191) NM container diagnostics for excess resource usage can be lost if task fails while being killed

2016-03-30 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3191:
---

 Summary: NM container diagnostics for excess resource usage can be 
lost if task fails while being killed
 Key: TEZ-3191
 URL: https://issues.apache.org/jira/browse/TEZ-3191
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe


This is the Tez version of MAPREDUCE-4955.  I saw a misconfigured Tez job 
report a task attempt as failed due to a filesystem closed error because the NM 
killed the container due to excess memory usage.  Unfortunately the SIGTERM 
sent by the NM caused the filesystem shutdown hook to close the filesystems, 
and that triggered a failure in the main thread.  If the failure is reported to 
the AM via the umbilical before the NM container status is received via the RM 
then the useful container diagnostics from the NM are lost in the job history.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3167) TestRecovery occasionally times out

2016-03-19 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3167:
---

 Summary: TestRecovery occasionally times out
 Key: TEZ-3167
 URL: https://issues.apache.org/jira/browse/TEZ-3167
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe


TestRecovery has been timing out sporadically in precommit builds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3141) mapreduce.task.timeout is not translated to container heartbeat timeout

2016-02-25 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3141:
---

 Summary: mapreduce.task.timeout is not translated to container 
heartbeat timeout
 Key: TEZ-3141
 URL: https://issues.apache.org/jira/browse/TEZ-3141
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.1
Reporter: Jason Lowe
Assignee: Jason Lowe


TEZ-2966 added the deprecation to the runtime key map, but the container  
timeout is an AM-level property and therefore the runtime map translation is 
missed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3114) Shuffle OOM due to EventMetaData flood

2016-02-11 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3114:
---

 Summary: Shuffle OOM due to EventMetaData flood
 Key: TEZ-3114
 URL: https://issues.apache.org/jira/browse/TEZ-3114
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe


A task encountered an OOM during shuffle, and investigation of the heap dump 
showed a lot of memory being consumed by almost 3.5 million EventMetaData 
objects.  Auto-parallelism had reduced the number of tasks in the vertex to 1 
and there were 2000 upstream tasks to shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3115) Shuffle string handling adds significant memory overhead

2016-02-11 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3115:
---

 Summary: Shuffle string handling adds significant memory overhead
 Key: TEZ-3115
 URL: https://issues.apache.org/jira/browse/TEZ-3115
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe


While investigating the OOM heap dump from TEZ-3114 I noticed that the 
ShuffleManager and other shuffle-related objects were holding onto many strings 
that added up to over a hundred megabytes of memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3102) Fetch failure of a speculated task causes job hang

2016-02-08 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3102:
---

 Summary: Fetch failure of a speculated task causes job hang
 Key: TEZ-3102
 URL: https://issues.apache.org/jira/browse/TEZ-3102
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical


If a task speculates then succeeds, one task will be marked successful and the 
other killed. Then if the task retroactively fails due to fetch failures the 
Tez AM will fail to reschedule another task. This results in a hung job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3066) TaskAttemptFinishedEvent ConcurrentModificationException if processed by RecoveryService and history logging simultaneously

2016-01-20 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3066:
---

 Summary: TaskAttemptFinishedEvent ConcurrentModificationException 
if processed by RecoveryService and history logging simultaneously
 Key: TEZ-3066
 URL: https://issues.apache.org/jira/browse/TEZ-3066
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe


A ConcurrentModificationException can occur if a TaskAttemptFinishedEvent is 
processed simultaneously by the recovery service and another history logging 
service.  Sample stacktraces to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3051) Vertex failed with invalid event DAG_VERTEX_RERUNNING at SUCCEEDED

2016-01-19 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3051:
---

 Summary: Vertex failed with invalid event DAG_VERTEX_RERUNNING at 
SUCCEEDED
 Key: TEZ-3051
 URL: https://issues.apache.org/jira/browse/TEZ-3051
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe


I saw a job fail due to an internal error on a vertex: 
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
DAG_VERTEX_RERUNNING at SUCCEEDED

Stacktrace to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3009) Errors that occur during container task acquisition are not logged

2015-12-17 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3009:
---

 Summary: Errors that occur during container task acquisition are 
not logged
 Key: TEZ-3009
 URL: https://issues.apache.org/jira/browse/TEZ-3009
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe


If TezChild encounters an error while trying to obtain a task the error will be 
silently handled.  This results in a mysterious shutdown of containers with no 
cause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3010) Container task acquisition has no retries for errors

2015-12-17 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3010:
---

 Summary: Container task acquisition has no retries for errors
 Key: TEZ-3010
 URL: https://issues.apache.org/jira/browse/TEZ-3010
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe


There's no retries for errors that occur during task acquisition.  If any error 
occurs the container will just shut down, resulting in task attempt failures if 
a task attempt happened to be assigned to the container by the AM.  The 
container should try harder to obtain the task before giving up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)