Re: [VOTE] Release Apache Tez-0.10.1 RC0

2021-06-23 Thread Jason Lowe
Thanks for organizing this release!

Note that I could not find the signing key at
https://dist.apache.org/repos/dist/release/tez/KEYS.

Also I noticed the copyright in the NOTICE.txt file is out of date. Other
Apache projects have the clause “and onwards” after the initial copyright
date to cover this.

Otherwise +1 for the release. I verified the signatures and digests, built
from source from git tag and successfully ran simple tests.

Jason

On Sun, Jun 20, 2021 at 3:04 AM László Bodor 
wrote:

> Hi Team!
>
> I have created a tez-0.10.1 release candidate rc0.
> GIT source tag (release-0.10.1-rc0)
>
>
> https://gitbox.apache.org/repos/asf?p=tez.git;a=commit;h=refs/tags/release-0.10.1-rc0
> (355fbc14caeaefab08cb2045f2d9d83435c5be70
> 
> )
>
> Staging site:
> https://dist.apache.org/repos/dist/dev/tez/apache-tez-0.10.1-rc0/ (svn
> revision: 48404)
>
> PGP release keys (signed using 0x4ECA5CA5E303605A)
> http://pgp.mit.edu:11371/pks/lookup?op=vindex=0x4ECA5CA5E303605A
>
> KEYS file available at https://dist.apache.org/repos/dist/release/tez/KEYS
>
> One can look into the issues fixed in this release at:
>
> https://issues.apache.org/jira/browse/TEZ-4309?jql=project%20%3D%20%22Apache%20Tez%22%20%20and%20fixVersion%20%3D%20%220.10.1%22
>
> Vote will be open for at least 72 hours.
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove (and reason why)
>
> Regards,
> Laszlo Bodor
>


Re: [VOTE] Release Apache Tez-0.10.0 RC1

2020-10-16 Thread Jason Lowe
Thanks for putting together the release, László!  Note that your signing
key was not listed in the tez/KEYS file. That should be corrected.

Otherwise +1

- verified signatures and digests
- successfully built from source
- examined LICENSE and NOTICE

Jason

On Mon, Oct 12, 2020 at 2:54 AM László Bodor 
wrote:

> Hi Team!
>
> This is a kind reminder about RC1, let me proceed with that, I would
> appreciate +1s.
> Just remember, if RC0 was fine, RC1 will be perfect. :)
>
> Changes since RC0:
> https://issues.apache.org/jira/browse/TEZ-4228
> https://issues.apache.org/jira/browse/TEZ-4230
> https://issues.apache.org/jira/browse/TEZ-4234
> https://issues.apache.org/jira/browse/TEZ-4238
>
> Regards,
> Laszlo Bodor
>
> On Thu, 8 Oct 2020 at 18:57, László Bodor 
> wrote:
>
> > Hi Team!
> >
> > I have created an tez-0.10.0 release candidate rc1.
> > GIT source tag (release-0.10.0-rc1)
> >
> >
> >
> https://gitbox.apache.org/repos/asf?p=tez.git;a=commit;h=refs/tags/release-0.10.0-rc1
> >  (22fec6c0ecc7ebe6f6f28800935cc6f69794dad5)
> >
> > Staging site:
> > https://dist.apache.org/repos/dist/dev/tez/apache-tez-0.10.0-rc1/ (svn
> > revision: 41851)
> >
> > PGP release keys (signed using 0x4ECA5CA5E303605A)
> > http://pgp.mit.edu:11371/pks/lookup?op=vindex=0x4ECA5CA5E303605A
> >
> > KEYS file available at
> https://dist.apache.org/repos/dist/release/tez/KEYS
> >
> > One can look into the issues fixed in this release at
> >
> https://issues.apache.org/jira/browse/TEZ-4230?jql=project%20%3D%20%22Apache%20Tez%22%20%20and%20fixVersion%20%3D%20%220.10.0%22
> >
> > Vote will be open for atleast 72 hours.
> > [ ] +1 approve
> > [ ] +0 no opinion
> > [ ] -1 disapprove (and reason why)
> >
> > Regards,
> > Laszlo Bodor
> >
>


Re: [DISCUSS] Early Move to gitbox

2019-01-07 Thread Jason Lowe
Sorry for the late reply.  I'm +1 for getting this moved early.  It sounds
like the mandatory move could be inconvenient and surprising.  Better to do
this on our own terms.

Jason

On Fri, Dec 14, 2018 at 4:54 PM Jonathan Eagles  wrote:

> Apache Tez git repository is in git-wip-us server and it will be
> decommissioned.
> Please discuss issues and preferred timeline, I'll file a JIRA ticket
> with INFRA to
> migrate to https://gitbox.apache.org/ and update documentation.
>
> According to ASF infra team, the timeframe is as follows:
>
> > - December 9th 2018 -> January 9th 2019: Voluntary (coordinated)
> relocation
> > - January 9th -> February 6th: Mandated (coordinated) relocation
> > - February 7th: All remaining repositories are mass migrated.
> > This timeline may change to accommodate various scenarios.
>
> If we get consensus by January 9th, I can file a ticket with INFRA and
> migrate it.
> Even if we cannot got consensus, the repository will be migrated by
> February 7th.
>
> Regards,
> jeagles
>
> 
> ORIGINAL NOTICE
> 
> Hello Apache projects,
>
> I am writing to you because you may have git repositories on the
> git-wip-us server, which is slated to be decommissioned in the coming
> months. All repositories will be moved to the new gitbox service which
> includes direct write access on github as well as the standard ASF
> commit access via gitbox.apache.org.
>
> ## Why this move? ##
> The move comes as a result of retiring the git-wip service, as the
> hardware it runs on is longing for retirement. In lieu of this, we
> have decided to consolidate the two services (git-wip and gitbox), to
> ease the management of our repository systems and future-proof the
> underlying hardware. The move is fully automated, and ideally, nothing
> will change in your workflow other than added features and access to
> GitHub.
>
> ## Timeframe for relocation ##
> Initially, we are asking that projects voluntarily request to move
> their repositories to gitbox, hence this email. The voluntary
> timeframe is between now and January 9th 2019, during which projects
> are free to either move over to gitbox or stay put on git-wip. After
> this phase, we will be requiring the remaining projects to move within
> one month, after which we will move the remaining projects over.
>
> To have your project moved in this initial phase, you will need:
>
> - Consensus in the project (documented via the mailing list)
> - File a JIRA ticket with INFRA to voluntarily move your project repos
>over to gitbox (as stated, this is highly automated and will take
>between a minute and an hour, depending on the size and number of
>your repositories)
>
> To sum up the preliminary timeline;
>
> - December 9th 2018 -> January 9th 2019: Voluntary (coordinated)
>relocation
> - January 9th -> February 6th: Mandated (coordinated) relocation
> - February 7th: All remaining repositories are mass migrated.
>
> This timeline may change to accommodate various scenarios.
>
> ## Using GitHub with ASF repositories ##
> When your project has moved, you are free to use either the ASF
> repository system (gitbox.apache.org) OR GitHub for your development
> and code pushes. To be able to use GitHub, please follow the primer
> at: https://reference.apache.org/committer/github
>
>
> We appreciate your understanding of this issue, and hope that your
> project can coordinate voluntarily moving your repositories in a
> timely manner.
>
> All settings, such as commit mail targets, issue linking, PR
> notification schemes etc will automatically be migrated to gitbox as
> well.
>
> With regards, Daniel on behalf of ASF Infra.
>
> PS:For inquiries, please reply to us...@infra.apache.org, not your
> project's dev list :-).
>


[jira] [Resolved] (TEZ-3982) DAGAppMaster and tasks should not report negative or invalid progress

2018-09-21 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved TEZ-3982.
-
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 0.9.2

Thanks, [~kshukla]!  I committed this to branch-0.9.

> DAGAppMaster and tasks should not report negative or invalid progress
> -
>
> Key: TEZ-3982
> URL: https://issues.apache.org/jira/browse/TEZ-3982
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1, 0.10.0
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
>Priority: Major
> Fix For: 0.9.2, 0.10.1
>
> Attachments: TEZ-3982.001.patch, TEZ-3982.002.patch, 
> TEZ-3982.003.patch, TEZ-3982.004.patch, TEZ-3982.005.branch-0.9.patch
>
>
> AM fails (AMRMClient expects non negative progress) if any component reports 
> invalid or -ve progress, DagAppMaster/Tasks should check and report 
> accordingly to allow the AM to execute.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (TEZ-3982) DAGAppMaster and tasks should not report negative or invalid progress

2018-09-21 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reopened TEZ-3982:
-

This broke the branch-0.9 build.  Looks like MonotonicClock isn't in the 
version of Hadoop branch-0.9 depends upon:
{noformat}
[ERROR] 
/tez/tez-dag/src/test/java/org/apache/tez/dag/app/TestDAGAppMaster.java:[17,35] 
cannot find symbol
  symbol:   class MonotonicClock
  location: package org.apache.hadoop.yarn.util
[ERROR] 
/tez/tez-dag/src/test/java/org/apache/tez/dag/app/TestDAGAppMaster.java:[431,32]
 cannot find symbol
  symbol:   class MonotonicClock
  location: class org.apache.tez.dag.app.TestDAGAppMaster
[INFO] 2 errors 
{noformat}

I reverted this from branch-0.9 to fix the build.

> DAGAppMaster and tasks should not report negative or invalid progress
> -
>
> Key: TEZ-3982
> URL: https://issues.apache.org/jira/browse/TEZ-3982
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.9.1, 0.10.0
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
>Priority: Major
> Fix For: 0.10.1
>
> Attachments: TEZ-3982.001.patch, TEZ-3982.002.patch, 
> TEZ-3982.003.patch, TEZ-3982.004.patch
>
>
> AM fails (AMRMClient expects non negative progress) if any component reports 
> invalid or -ve progress, DagAppMaster/Tasks should check and report 
> accordingly to allow the AM to execute.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TEZ-3989) Fix by-laws related to emeritus clause

2018-09-13 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved TEZ-3989.
-
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 0.10.1

Thanks, [~hitesh]! I committed this to master.

> Fix by-laws related to emeritus clause 
> ---
>
> Key: TEZ-3989
> URL: https://issues.apache.org/jira/browse/TEZ-3989
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
>Priority: Major
> Fix For: 0.10.1
>
>
> The emeritus clause is not valid and needs to be updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: 0.10.next release version

2018-07-12 Thread Jason Lowe
I've created a 0.10.1 release in JIRA.

Jason


On Wed, Jul 11, 2018 at 2:31 PM, Eric Wohlstadter 
wrote:

> In preparation for 0.10.0 release planning, can someone help to add a
> 0.10.next Release Version to the JIRA?
>
> Thanks!
>


[jira] [Created] (TEZ-3935) DAG aware scheduler should release unassigned new containers rather than hold them

2018-05-14 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3935:
---

 Summary: DAG aware scheduler should release unassigned new 
containers rather than hold them
 Key: TEZ-3935
 URL: https://issues.apache.org/jira/browse/TEZ-3935
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe
Assignee: Jason Lowe


I saw a case for a very large job with many containers where the DAG aware 
scheduler was getting behind on assigning containers.  Newly assigned 
containers were not finding any matching request, so they were queued for reuse 
processing.  However it took so long to get through all of the task and 
container events that the container allocations expired before the container 
was finally assigned and attempted to be launched.

Newly assigned containers are assigned to their matching requests, even if that 
violates the DAG priorities, so it should be safe to simply release these if no 
tasks could be found to use them.  The matching request has either been removed 
or already satisified with a reused container.  Besides, if we can't find any 
tasks to take the newly assigned container then it is very likely we have 
plenty of reusable containers already, and keeping more containers just makes 
the job a resource hog on the cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Move master to Hadoop 3+ and create separate 0.9.x line

2018-04-11 Thread Jason Lowe
There was a discussion thread that was started two weeks before the
vote thread, see
http://mail-archives.apache.org/mod_mbox/tez-dev/201803.mbox/browser.
Granted there weren't many comments, but there was a discussion thread
with no voiced objections well in advance of the vote thread.

Jason


On Tue, Apr 10, 2018 at 10:18 AM, Jonathan Eagles <jeag...@gmail.com> wrote:
> Thoughts/Inputs/Discussion from Pig/Hive/Flink/Scalding/Scope communities?
>
> I wish we had used a discussion thread to gather more input from
> Pig/Hive/Flink/Scalding/Scope community before starting this vote whose
> outcome affects them. Without discussion or votes from those communities
> I'm not sure the community support for this decision. Should we consider
> canceling this vote to gather input first?
>
> On Mon, Apr 9, 2018 at 10:09 AM, Kuhu Shukla <kshu...@oath.com.invalid>
> wrote:
>
>> +1.
>>
>> Thank you Eric for floating the proposal.
>>
>> Regards,
>> Kuhu
>>
>> On Mon, Apr 9, 2018 at 9:56 AM, Jason Lowe <jl...@oath.com.invalid> wrote:
>>
>> > +1
>> >
>> > Jason
>> >
>> > On Fri, Apr 6, 2018 at 4:45 PM, Eric Wohlstadter <wohls...@cs.ubc.ca>
>> > wrote:
>> > > Please vote (binding or unbinding) on the following proposal. The vote
>> > will
>> > > be open until 3pm (Pacific) April 13th.
>> > >
>> > >
>> > > Proposal: Move master to support minimum Hadoop 3+ (0.10.x line) and
>> > create
>> > > separate branch for Hadoop 2 (0.9.x line)
>> > >
>> > >
>> > > Details:
>> > >
>> > >
>> > >
>> > >- Tez master branch would support only Hadoop 3+ moving forward
>> > >
>> > >
>> > >- As a general policy, Maven dependencies on master are required not
>> > to
>> > >have conflicts with the dependencies of the corresponding minimum
>> > >supported Hadoop (the dependency versions can vary between Tez
>> master
>> > and
>> > >Hadoop if the versions are advertised as compatible by the
>> dependency
>> > >provider).
>> > >
>> > >- As a general policy, dependency conflicts between Tez and Hadoop
>> > >should be resolved by using compatible jars. Shims/Shading could be
>> > used on
>> > >a case-by-case basis, but not as a general policy.
>> > >
>> > >
>> > >- A separate branch and distribution (e.g. on Maven Central) will be
>> > >created to maintain the 0.9.x line with minumum support for Hadoop
>> > 2.7.x
>> > >
>> > >
>> > >
>> > >- Bug fixes would be required to be pushed to both to master and the
>> > >0.9.x line (unless they are specific to one of them)
>> > >
>> > >
>> > >
>> > >- Major feature or performance improvements would be required to be
>> > >pushed to both master and the 0.9.x line (unless they require Hadoop
>> > 3+ or
>> > >have dependent library conflicts with Hadoop 2.x, in which case they
>> > may be
>> > >pushed only to master)
>> > >
>> > >
>> > >
>> > >- Minor feature or performance improvements can be pushed only to
>> > master
>> >
>>


[jira] [Reopened] (TEZ-3913) Precommit build fails to post to JIRA

2018-04-09 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reopened TEZ-3913:
-

> Precommit build fails to post to JIRA
> -
>
> Key: TEZ-3913
> URL: https://issues.apache.org/jira/browse/TEZ-3913
> Project: Apache Tez
>  Issue Type: Bug
>    Reporter: Jason Lowe
>    Assignee: Jason Lowe
>Priority: Major
> Fix For: 0.9.2
>
> Attachments: TEZ-3913.001.patch
>
>
> The precommit build is failing to post comments to Jira due to a 404 error:
> {noformat}
> Unable to log in to server: 
> https://issues.apache.org/jira/rpc/soap/jirasoapservice-v2 with user: tezqa.
>  Cause: (404)404
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TEZ-3913) Precommit build fails to post to JIRA

2018-04-09 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3913:
---

 Summary: Precommit build fails to post to JIRA
 Key: TEZ-3913
 URL: https://issues.apache.org/jira/browse/TEZ-3913
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe
Assignee: Jason Lowe


The precommit build is failing to post comments to Jira due to a 404 error:
{noformat}
Unable to log in to server: 
https://issues.apache.org/jira/rpc/soap/jirasoapservice-v2 with user: tezqa.
 Cause: (404)404
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Aligning master with Hadoop 3 and create separate 0.9.x line

2018-03-26 Thread Jason Lowe
+1 for incrementing the required Hadoop version from 2.7 as long as we
continue to push bugfixes to the 0.9 line for a while.  We currently
have a "hadoop28" profile in Tez which is mostly compatible with
Hadoop 3.x, but it does not get much testing.  There is no release
vehicle for it, and it does not even get tested from the precommit
build.  Promoting this or a 3.x profile to the main build is the most
straightforward way to get it tested and released in an
easy-to-consume form.

This does mean we would need to maintain two release lines for a
while, at least until users and downstream projects migrate away from
Hadoop 2.7.  We've done two lines before (even three, if we consider
the days of 0.9.x, 0.8.x, and 0.7.x all co-existing), and in this case
I think the cost of maintaining those two lines is worth it to move
the project forward as the stack migrates to Hadoop 3.x.

Jason



On Thu, Mar 22, 2018 at 7:16 PM, Eric Wohlstadter
 wrote:
> Hi all,
> I’d like to propose that we move towards aligning the Tez master branch with 
> support for Hadoop 3+ only.
>  A separate branch and distribution (e.g. on Maven Central) would be created 
> to maintain the 0.9.x line with support for Hadoop 2.7+.
>
> This will help ensure that Tez can continue to move forward with other 
> progress in the greater Hadoop community.
> Since Hadoop 3 is not backward compatible with Hadoop 2, my opinion is that 
> it is too difficult for Tez to maintain such backward compatibility
>
>
>   *   Tez master branch would support only Hadoop 3+ moving forward
>   *   Bug fixes would be required to be pushed to both to master and the 
> 0.9.x line
>   *   Major feature or performance improvements would be required to be 
> pushed to both master and the 0.9.x line (unless they require Hadoop 3+)
>   *   Minor feature or performance improvements can be pushed only to master
>   *   A new release with Hadoop 3+ only support would be placed on high 
> priority (possibly 0.10?)
>  *   At a minimum the issues under TEZ-3903 would be required
>
> Please help to provide any feedback or comments about this unofficial 
> proposal.
> This is not an official vote but it would help to get people’s 
> thoughts/questions or unofficial (+1, -1).
>


[jira] [Created] (TEZ-3898) TestTezCommonUtils fails when compiled against hadoop version >= 2.8

2018-02-16 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3898:
---

 Summary: TestTezCommonUtils fails when compiled against hadoop 
version >= 2.8
 Key: TEZ-3898
 URL: https://issues.apache.org/jira/browse/TEZ-3898
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe
Assignee: Jason Lowe


TestTezCommonUtils fails when compiled against hadoop 2.8 or later:
{noformat}
$ cd tez-api
$ mvn test -Phadoop28 -P-hadoop27 -Dhadoop.version=2.8.3
-Dtest=TestTezCommonUtilsRunning org.apache.tez.common.TestTezCommonUtils
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.266 sec <<< 
FAILURE!
org.apache.tez.common.TestTezCommonUtils  Time elapsed: 0.265 sec  <<< ERROR!
java.lang.NoClassDefFoundError: 
org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetFactory
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
org.apache.hadoop.hdfs.server.datanode.FsDatasetTestUtils$Factory.getFactory(FsDatasetTestUtils.java:47)
at 
org.apache.hadoop.hdfs.MiniDFSCluster$Builder.(MiniDFSCluster.java:199)
at 
org.apache.tez.common.TestTezCommonUtils.setup(TestTezCommonUtils.java:60)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TEZ-3896) TestATSV15HistoryLoggingService#testNonSessionDomains is failing

2018-02-16 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3896:
---

 Summary: TestATSV15HistoryLoggingService#testNonSessionDomains is 
failing
 Key: TEZ-3896
 URL: https://issues.apache.org/jira/browse/TEZ-3896
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe
Assignee: Jason Lowe


TestATSV15HistoryLoggingService always fails:
{noformat}
Running org.apache.tez.dag.history.logging.ats.TestATSV15HistoryLoggingService
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.789 sec <<< 
FAILURE!
testNonSessionDomains(org.apache.tez.dag.history.logging.ats.TestATSV15HistoryLoggingService)
  Time elapsed: 0.477 sec  <<< FAILURE!
org.mockito.exceptions.verification.TooManyActualInvocations: 
historyACLPolicyManager.updateTimelineEntityDomain(
,
"session-id"
);
Wanted 5 times:
-> at 
org.apache.tez.dag.history.logging.ats.TestATSV15HistoryLoggingService.testNonSessionDomains(TestATSV15HistoryLoggingService.java:231)
But was 6 times. Undesired invocation:
-> at 
org.apache.tez.dag.history.logging.ats.ATSV15HistoryLoggingService.logEntity(ATSV15HistoryLoggingService.java:389)

at 
org.apache.tez.dag.history.logging.ats.TestATSV15HistoryLoggingService.testNonSessionDomains(TestATSV15HistoryLoggingService.java:231)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TEZ-3821) Ability to fail fast tasks that write too much to local disk

2017-08-21 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3821:
---

 Summary: Ability to fail fast tasks that write too much to local 
disk
 Key: TEZ-3821
 URL: https://issues.apache.org/jira/browse/TEZ-3821
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe


It would be nice to have a configurable limit such that any task that wrote 
data to the local filesystem beyond that limit would fail quickly rather than 
waiting for the disk to fill much later, impacting other jobs on the cluster.

This is essentially asking for the Tez version of MAPREDUCE-6489.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: [VOTE] Release Apache Tez-0.9.0 RC1

2017-07-20 Thread Jason Lowe
+1
- Verified signatures and digests- Verified rat check was clean- Built from 
source successfully- Deployed to a cluster and ran some sample jobs
Jason
 

On Tuesday, July 18, 2017 1:09 AM, zhiyuan yang  wrote:
 

 I have created a tez-0.9.0 release candidate (rc1). 

GIT source tag:
https://git-wip-us.apache.org/repos/asf/tez/repo?p=tez.git;a=log;h=refs/tags/release-0.9.0-rc1
 
https://git-wip-us.apache.org/repos/asf/tez/repo?p=tez.git;a=log;h=refs/tags/release-0.9.0-rc
 
1
Staging site:
https://dist.apache.org/repos/dist/dev/tez/apache-tez-0.9.0-rc1/ 


Nexus Staging URL:
https://repository.apache.org/content/repositories/orgapachetez-10 
60 


PGP release keys:
https://pgp.mit.edu/pks/lookup?op=get=0x9388FB144BC5CC4F 

KEYS file available at
https://dist.apache.org/repos/dist/release/tez/KEYS 


List of issues fixed in the release:
https://issues.apache.org/jira/browse/TEZ/fixforversion/12334632 

Also available in CHANGES.txt within the release tarball.

Vote will be open for at least 72 hours ( until the required number of PMC
votes are obtained).

[ ] +1 approve
[ ] +0 no opinion
[ ] -1 disapprove (and reason why)

Here’s my +1

Thanks!
Zhiyuan

   

[jira] [Created] (TEZ-3770) DAG-aware YARN task scheduler

2017-06-22 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3770:
---

 Summary: DAG-aware YARN task scheduler
 Key: TEZ-3770
 URL: https://issues.apache.org/jira/browse/TEZ-3770
 Project: Apache Tez
  Issue Type: New Feature
Reporter: Jason Lowe
Assignee: Jason Lowe


There are cases where priority alone does not convey the relationship between 
tasks, and this can cause problems when scheduling or preempting tasks.  If the 
YARN task scheduler was aware of the relationship between tasks then it could 
make smarter decisions when trying to assign tasks to containers or preempt 
running tasks to schedule pending tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TEZ-3744) Findbug warnings after TEZ-3334 merge

2017-05-25 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3744:
---

 Summary: Findbug warnings after TEZ-3334 merge
 Key: TEZ-3744
 URL: https://issues.apache.org/jira/browse/TEZ-3744
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Jason Lowe


There are findbug warnings in precommit builds that appear to be caused by the 
recent TEZ-3334 merge.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (TEZ-3741) Tez outputs should free memory when closed

2017-05-25 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3741:
---

 Summary: Tez outputs should free memory when closed
 Key: TEZ-3741
 URL: https://issues.apache.org/jira/browse/TEZ-3741
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.1, 0.9.0
Reporter: Jason Lowe
Assignee: Jason Lowe


Memory buffers aren't being released as quickly as they could be, e.g.: 
DefaultSorter is holding onto the very large kvbuffer byte array even after 
close() is called, and Ordered and Unordered outputs should remove references 
to sorter and kvWriter in their close.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (TEZ-3738) TestUnorderedPartitionedKVWriter fails due to RejectedExecutionException

2017-05-25 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved TEZ-3738.
-
Resolution: Duplicate

> TestUnorderedPartitionedKVWriter fails due to RejectedExecutionException
> 
>
> Key: TEZ-3738
> URL: https://issues.apache.org/jira/browse/TEZ-3738
> Project: Apache Tez
>  Issue Type: Bug
>    Reporter: Jason Lowe
>
> TestUnorderedPartitionedKVWriter is failing in recent precommit builds.  
> Stacktrace to follow.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: [VOTE] Merge TEZ-3334 into master

2017-05-22 Thread Jason Lowe
+1, looking forward to this functionality being available in an official 
release!
Jason
 

On Friday, May 19, 2017 3:16 PM, Jonathan Eagles  wrote:
 

 This vote is to merge the Tez Shuffle Handler feature branch (TEZ-3334)
into the master branch. Having done extensive testing and thanks to all the
feedback given, the Tez Shuffle Handler delivers on its promise to reduce
shuffle times for the auto-reduce case and pave the way for more shuffle
types in the future.

Please view a patch of the proposed merge attached to the TEZ-3334 jira
https://issues.apache.org/jira/browse/TEZ-3334

As to be the most useful and time efficient, please test and vote early if
you have any misgivings or questions about the readiness of the branch so
as to not to delay the merge process and increase the amount of work on my
part to continue to maintain this branch with master.

This vote will be open until Wednesday around the same time.

Here is my +1.

Regards,
jeagles


   

[jira] [Resolved] (TEZ-3702) Tez shuffle jar includes service loader entry for ClientProtocolProvider but not the corresponding class

2017-04-27 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved TEZ-3702.
-
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: TEZ-3334

Thanks for the reviews!  I committed this to the TEZ-3334 branch.

> Tez shuffle jar includes service loader entry for ClientProtocolProvider but 
> not the corresponding class
> 
>
> Key: TEZ-3702
> URL: https://issues.apache.org/jira/browse/TEZ-3702
> Project: Apache Tez
>  Issue Type: Sub-task
>Affects Versions: TEZ-3334
>    Reporter: Jason Lowe
>Assignee: Jason Lowe
> Fix For: TEZ-3334
>
> Attachments: TEZ-3702.001.patch
>
>
> The tez-aux-shuffle jar is shading the tez-mapreduce dependency but that 
> causes the service loader entry for 
> org.apache.hadoop.mapreduce.protocol.ClientProtocolProvider to be included 
> without including the referenced 
> org.apache.tez.mapreduce.client.YarnTezClientProtocolProvider class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (TEZ-3695) TestTezSharedExecutor fails sporadically

2017-04-24 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3695:
---

 Summary: TestTezSharedExecutor fails sporadically
 Key: TEZ-3695
 URL: https://issues.apache.org/jira/browse/TEZ-3695
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe


TestTezSharedExecutor#testSerialExecution is timing out more often than not for 
me when running the full TezTezSharedExecutor test suite.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (TEZ-3693) ControlledClock is not used

2017-04-21 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3693:
---

 Summary: ControlledClock is not used
 Key: TEZ-3693
 URL: https://issues.apache.org/jira/browse/TEZ-3693
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe
Priority: Trivial


The org.apache.tez.dag.app.ControlledClock class is not referenced in the 
source.  Oddly this is not a test class, like MockClock, as I would have 
expected.  If this is not part of the Tez API then it can be removed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (TEZ-3535) YarnTaskScheduler can hold onto low priority containers until they expire

2016-11-10 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3535:
---

 Summary: YarnTaskScheduler can hold onto low priority containers 
until they expire
 Key: TEZ-3535
 URL: https://issues.apache.org/jira/browse/TEZ-3535
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.8.4, 0.7.1
Reporter: Jason Lowe
Assignee: Jason Lowe


With container reuse enabled, YarnTaskScheduler will retain but not schedule 
any container allocations that are lower priority than the highest priority 
task requests.  This can lead to poor performance as these lower priority 
containers clog up resources needed for high priority allocations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3508) TestTaskScheduler cleanup

2016-11-02 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3508:
---

 Summary: TestTaskScheduler cleanup
 Key: TEZ-3508
 URL: https://issues.apache.org/jira/browse/TEZ-3508
 Project: Apache Tez
  Issue Type: Test
Reporter: Jason Lowe
Assignee: Jason Lowe


TestTaskScheduler is very fragile, since it builds mocks of the AMRM client 
that is tied very specifically to the particulars of the way the 
YarnTaskScheduler is coded.  Any variance in that often leads to test failures 
because the mocks no longer accurately reflect what the real AMRM client does.

It would be much simpler and more robust to leverage the AMRMClientForTest and 
AMRMAsyncClientForTest classes in TestTaskSchedulerHelpers rather than maintain 
fragile mocks attempting to emulate the behaviors of those classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3491) Tez job can hang due to container priority inversion

2016-10-25 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3491:
---

 Summary: Tez job can hang due to container priority inversion
 Key: TEZ-3491
 URL: https://issues.apache.org/jira/browse/TEZ-3491
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.1
Reporter: Jason Lowe
Priority: Critical


If the Tez AM receives containers at a lower priority than the highest priority 
task being requested then it fails to assign the container to any task.  In 
addition if the container is new then it refuses to release it if there are any 
pending tasks.  If it takes too long for the higher priority requests to be 
fulfilled (e.g.: the lower priority containers are filling the queue) then 
eventually YARN will expire the unused lower priority containers since they 
were never launched.  The Tez AM then never re-requests these lower priority 
containers and the job hangs because the AM is waiting for containers from the 
RM that the RM already sent and expired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3462) Task attempt failure during container shutdown loses useful container diagnostics

2016-10-06 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3462:
---

 Summary: Task attempt failure during container shutdown loses 
useful container diagnostics
 Key: TEZ-3462
 URL: https://issues.apache.org/jira/browse/TEZ-3462
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.1
Reporter: Jason Lowe


When a nodemanager kills a task attempt due to excessive memory usage it will 
send a SIGTERM followed by a SIGKILL.  It also sends a useful diagnostic 
message with the container completion event to the RM which will eventually 
make it to the AM on a subsequent heartbeat.

However if the JVM shutdown processing causes an error in the task (e.g.: 
filesystem being closed by shutdown hook) then the task attempt can report a 
failure before the useful NM diagnostic makes it to the AM.  The AM then 
records some other error as the task failure reason, and by the time the 
container completion status makes it to the AM it does not associate that error 
with the task attempt and the useful information is lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3444) Handling of fetch-failures should consider time spent producing output

2016-09-22 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3444:
---

 Summary: Handling of fetch-failures should consider time spent 
producing output
 Key: TEZ-3444
 URL: https://issues.apache.org/jira/browse/TEZ-3444
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Jason Lowe


When handling fetch failures and deciding whether the upstream task should be 
re-run, we should consider the duration of the upstream task that generated the 
data trying to be fetched.  If the upstream task ran for a long time then we 
may want to retry a bit harder before deciding to re-run.  If the upstream task 
executed in a few seconds then we should probably re-run the upstream task more 
aggressively since that may be cheaper than multiple retries that timeout.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3415) Ability to configure shuffle server listen queue length

2016-08-19 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3415:
---

 Summary: Ability to configure shuffle server listen queue length
 Key: TEZ-3415
 URL: https://issues.apache.org/jira/browse/TEZ-3415
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Jason Lowe






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TEZ-3336) Hive map-side join job sometimes fails with ROOT_INPUT_INIT_FAILURE

2016-08-09 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved TEZ-3336.
-
Resolution: Invalid

Closing this as invalid since it seems like a problem with Hive's use of Tez 
rather than Tez itself.  [~mithun] please reopen with details if you find 
otherwise.

> Hive map-side join job sometimes fails with ROOT_INPUT_INIT_FAILURE
> ---
>
> Key: TEZ-3336
> URL: https://issues.apache.org/jira/browse/TEZ-3336
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>    Reporter: Jason Lowe
>
> When Hive does a map-side join it can generate a DAG where a vertex has two 
> inputs, one from an upstream task and another using MRInputAMSplitGenerator.  
> If it takes a while for MRInputAMSplitGenerator to compute the splits and one 
> of the tasks for the other upstream vertex completes then the job can fail 
> with an error since MRInputAMSplitGenerator does not expect to receive any 
> events.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3368) NPE in DelayedContainerManager

2016-07-20 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3368:
---

 Summary: NPE in DelayedContainerManager
 Key: TEZ-3368
 URL: https://issues.apache.org/jira/browse/TEZ-3368
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.1
Reporter: Jason Lowe


Saw a Tez AM hang due to an NPE in the DelayedContainerManager:
{noformat}
2016-07-17 01:53:23,157 [ERROR] [DelayedContainerManager] 
|yarn.YarnUncaughtExceptionHandler|: Thread 
Thread[DelayedContainerManager,5,main] threw an Exception.
java.lang.NullPointerException
at 
org.apache.tez.dag.app.rm.TezAMRMClientAsync.getMatchingRequestsForTopPriority(TezAMRMClientAsync.java:142)
at 
org.apache.tez.dag.app.rm.YarnTaskSchedulerService.getMatchingRequestWithoutPriority(YarnTaskSchedulerService.java:1474)
at 
org.apache.tez.dag.app.rm.YarnTaskSchedulerService.access$500(YarnTaskSchedulerService.java:84)
at 
org.apache.tez.dag.app.rm.YarnTaskSchedulerService$NodeLocalContainerAssigner.assignReUsedContainer(YarnTaskSchedulerService.java:1869)
at 
org.apache.tez.dag.app.rm.YarnTaskSchedulerService.assignReUsedContainerWithLocation(YarnTaskSchedulerService.java:1753)
at 
org.apache.tez.dag.app.rm.YarnTaskSchedulerService.assignDelayedContainer(YarnTaskSchedulerService.java:733)
at 
org.apache.tez.dag.app.rm.YarnTaskSchedulerService.access$600(YarnTaskSchedulerService.java:84)
at 
org.apache.tez.dag.app.rm.YarnTaskSchedulerService$DelayedContainerManager.run(YarnTaskSchedulerService.java:2030)
{noformat}

After the DelayedContainerManager thread exited the AM proceeded to receive 
requested containers that would go unused until the container allocations 
expired.  Then they would be re-requested, and the cycle repeated indefinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3350) Shuffle spills are not spilled to a container-specific directory

2016-07-14 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3350:
---

 Summary: Shuffle spills are not spilled to a container-specific 
directory
 Key: TEZ-3350
 URL: https://issues.apache.org/jira/browse/TEZ-3350
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.1
Reporter: Jason Lowe


If a Tez task receives too much input data and needs to spill the inputs to 
disk it ends up using a path that is not container-specific.  Therefore YARN 
will not automatically cleanup these files when the container exits as it 
should, and instead the files linger until the entire application completes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3306) Improve container priority assignments for vertices

2016-06-16 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3306:
---

 Summary: Improve container priority assignments for vertices
 Key: TEZ-3306
 URL: https://issues.apache.org/jira/browse/TEZ-3306
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Jason Lowe


After TEZ-3296 the priority space is sparsely used.  We should consider doing a 
breadth-first traversal of the DAG or reusing the client-side topological 
sorting to allow a more efficient use of the priority space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements

2016-06-09 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3296:
---

 Summary: Tez job can hang if two vertices at the same root 
distance have different task requirements
 Key: TEZ-3296
 URL: https://issues.apache.org/jira/browse/TEZ-3296
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.1
Reporter: Jason Lowe
Priority: Critical


When two vertices have the same distance from the root Tez will schedule 
containers with the same priority.  However those vertices could have different 
task requirements and therefore different capabilities.  As documented in 
YARN-314, YARN currently doesn't support requests for multiple sizes at the 
same priority.  In practice this leads to one vertex allocation requests 
clobbering the other, and that can result in a situation where the Tez AM is 
waiting on containers it will never receive from the RM.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3293) Fetch failures can cause a shuffle hang waiting for memory merge that never starts

2016-06-08 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3293:
---

 Summary: Fetch failures can cause a shuffle hang waiting for 
memory merge that never starts
 Key: TEZ-3293
 URL: https://issues.apache.org/jira/browse/TEZ-3293
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.8.3, 0.7.1
Reporter: Jason Lowe
Assignee: Jason Lowe


Tez jobs can hang in shuffle waiting for a memory merge that never starts.  
When a MapOutput is reserved it increments usedMemory but when it is unreserved 
it decrements usedMemory _and_ commitMemory.  If enough shuffle failures occur 
of sufficient size then commitMemory may never reach the merge threshold even 
after all outstanding transfers have committed and thus hang the shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3260) Ability to disable IFile checksum verification during shuffle transfers

2016-05-16 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3260:
---

 Summary: Ability to disable IFile checksum verification during 
shuffle transfers
 Key: TEZ-3260
 URL: https://issues.apache.org/jira/browse/TEZ-3260
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Jason Lowe


In TEZ-3237 [~rajesh.balamohan] requested the ability to avoid the 
computational expense of verifying IFile checksums during shuffle transfers for 
cases where the user is not concerned about data corruption and would like the 
additional performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3246) Improve diagnostics when DAG killed by user

2016-05-06 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3246:
---

 Summary: Improve diagnostics when DAG killed by user
 Key: TEZ-3246
 URL: https://issues.apache.org/jira/browse/TEZ-3246
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Jason Lowe


It would be nice if the DAG diagnostics included the user and host that 
originated the kill request for a DAG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3244) Allow overlap of input and output memory when they are not concurrent

2016-05-06 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3244:
---

 Summary: Allow overlap of input and output memory when they are 
not concurrent
 Key: TEZ-3244
 URL: https://issues.apache.org/jira/browse/TEZ-3244
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe


For cases when memory for inputs and outputs are not needed simultaneously it 
would be more efficient to allow inputs to use the memory normally set aside 
for outputs and vice-versa.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3237) Corrupted shuffle transfers to disk are not detected during transfer

2016-04-29 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3237:
---

 Summary: Corrupted shuffle transfers to disk are not detected 
during transfer
 Key: TEZ-3237
 URL: https://issues.apache.org/jira/browse/TEZ-3237
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe


When a shuffle transfer is larger than the single transfer limit it gets 
written straight to disk during the transfer.  Unfortunately there are no 
checksum validations performed during that transfer, so if the data is 
corrupted at the source or during transmit it goes undetected.  Only later when 
the task tries to consume the transferred data is the error detected, but at 
that point it's too late to blame the source task for the error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3213) Uncaught exception during vertex recovery leads to invalid state transition loop

2016-04-13 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3213:
---

 Summary: Uncaught exception during vertex recovery leads to 
invalid state transition loop
 Key: TEZ-3213
 URL: https://issues.apache.org/jira/browse/TEZ-3213
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe


If an uncaught exception occurs during a state transition from the RECOVERING 
vertex then V_INTERNAL_ERROR will be delivered to the state machine, but that 
event is not handled in the RECOVERING state.  That in turn causes a 
V_INTERNAL_ERROR event to be delivered to the state machine, and it loops 
logging the invalid transitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3203) DAG hangs when one of the upstream vertices has zero tasks

2016-04-07 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3203:
---

 Summary: DAG hangs when one of the upstream vertices has zero tasks
 Key: TEZ-3203
 URL: https://issues.apache.org/jira/browse/TEZ-3203
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe
Priority: Critical


A DAG hangs during execution if it has a vertex with multiple inputs and one of 
those upstream vertices has zero tasks and is using ShuffleVertexManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3193) Deadlock in AM during task commit request

2016-03-31 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3193:
---

 Summary: Deadlock in AM during task commit request
 Key: TEZ-3193
 URL: https://issues.apache.org/jira/browse/TEZ-3193
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.8.2, 0.7.1
Reporter: Jason Lowe
Priority: Blocker


The AM can deadlock between TaskImpl and TaskAttemptImpl.  Stacktrace and 
details in a followup comment.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3191) NM container diagnostics for excess resource usage can be lost if task fails while being killed

2016-03-30 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3191:
---

 Summary: NM container diagnostics for excess resource usage can be 
lost if task fails while being killed
 Key: TEZ-3191
 URL: https://issues.apache.org/jira/browse/TEZ-3191
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe


This is the Tez version of MAPREDUCE-4955.  I saw a misconfigured Tez job 
report a task attempt as failed due to a filesystem closed error because the NM 
killed the container due to excess memory usage.  Unfortunately the SIGTERM 
sent by the NM caused the filesystem shutdown hook to close the filesystems, 
and that triggered a failure in the main thread.  If the failure is reported to 
the AM via the umbilical before the NM container status is received via the RM 
then the useful container diagnostics from the NM are lost in the job history.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3167) TestRecovery occasionally times out

2016-03-19 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3167:
---

 Summary: TestRecovery occasionally times out
 Key: TEZ-3167
 URL: https://issues.apache.org/jira/browse/TEZ-3167
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jason Lowe


TestRecovery has been timing out sporadically in precommit builds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3141) mapreduce.task.timeout is not translated to container heartbeat timeout

2016-02-25 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3141:
---

 Summary: mapreduce.task.timeout is not translated to container 
heartbeat timeout
 Key: TEZ-3141
 URL: https://issues.apache.org/jira/browse/TEZ-3141
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.1
Reporter: Jason Lowe
Assignee: Jason Lowe


TEZ-2966 added the deprecation to the runtime key map, but the container  
timeout is an AM-level property and therefore the runtime map translation is 
missed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3114) Shuffle OOM due to EventMetaData flood

2016-02-11 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3114:
---

 Summary: Shuffle OOM due to EventMetaData flood
 Key: TEZ-3114
 URL: https://issues.apache.org/jira/browse/TEZ-3114
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe


A task encountered an OOM during shuffle, and investigation of the heap dump 
showed a lot of memory being consumed by almost 3.5 million EventMetaData 
objects.  Auto-parallelism had reduced the number of tasks in the vertex to 1 
and there were 2000 upstream tasks to shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3115) Shuffle string handling adds significant memory overhead

2016-02-11 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3115:
---

 Summary: Shuffle string handling adds significant memory overhead
 Key: TEZ-3115
 URL: https://issues.apache.org/jira/browse/TEZ-3115
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe


While investigating the OOM heap dump from TEZ-3114 I noticed that the 
ShuffleManager and other shuffle-related objects were holding onto many strings 
that added up to over a hundred megabytes of memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3102) Fetch failure of a speculated task causes job hang

2016-02-08 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3102:
---

 Summary: Fetch failure of a speculated task causes job hang
 Key: TEZ-3102
 URL: https://issues.apache.org/jira/browse/TEZ-3102
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical


If a task speculates then succeeds, one task will be marked successful and the 
other killed. Then if the task retroactively fails due to fetch failures the 
Tez AM will fail to reschedule another task. This results in a hung job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3066) TaskAttemptFinishedEvent ConcurrentModificationException if processed by RecoveryService and history logging simultaneously

2016-01-20 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3066:
---

 Summary: TaskAttemptFinishedEvent ConcurrentModificationException 
if processed by RecoveryService and history logging simultaneously
 Key: TEZ-3066
 URL: https://issues.apache.org/jira/browse/TEZ-3066
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe


A ConcurrentModificationException can occur if a TaskAttemptFinishedEvent is 
processed simultaneously by the recovery service and another history logging 
service.  Sample stacktraces to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3051) Vertex failed with invalid event DAG_VERTEX_RERUNNING at SUCCEEDED

2016-01-19 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3051:
---

 Summary: Vertex failed with invalid event DAG_VERTEX_RERUNNING at 
SUCCEEDED
 Key: TEZ-3051
 URL: https://issues.apache.org/jira/browse/TEZ-3051
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe


I saw a job fail due to an internal error on a vertex: 
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
DAG_VERTEX_RERUNNING at SUCCEEDED

Stacktrace to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3009) Errors that occur during container task acquisition are not logged

2015-12-17 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3009:
---

 Summary: Errors that occur during container task acquisition are 
not logged
 Key: TEZ-3009
 URL: https://issues.apache.org/jira/browse/TEZ-3009
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe


If TezChild encounters an error while trying to obtain a task the error will be 
silently handled.  This results in a mysterious shutdown of containers with no 
cause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3010) Container task acquisition has no retries for errors

2015-12-17 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3010:
---

 Summary: Container task acquisition has no retries for errors
 Key: TEZ-3010
 URL: https://issues.apache.org/jira/browse/TEZ-3010
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Jason Lowe


There's no retries for errors that occur during task acquisition.  If any error 
occurs the container will just shut down, resulting in task attempt failures if 
a task attempt happened to be assigned to the container by the AM.  The 
container should try harder to obtain the task before giving up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)