Re: detecting oversized bucket in mapreduce

2018-11-27 Thread Daniel Templeton
There are no per-key metrics provided by MapReduce, but you should be 
able to run your job with an identity reducer to see what the bucket 
sizes were.


If you're talking about doing it on the fly, there's no way to do that 
today.  The job is submitted with a fixed number of reducers, which also 
fixes the number of buckets.  YARN supports adding resources to an 
existing job, e.g. adding more reducers, but MapReduce doesn't make use 
of those capabilities.


Daniel

On 11/26/18 9:10 PM, Tianxiang Li wrote:

Dear Hadoop community,

I'm new to the Hadoop MapReduce code, and I'd like to know how I can get the 
number of records under a specific key value after the map process. I'd like to 
detect oversized buckets and perform further key division to split the records.

Thanks,
Peter




-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



Re: Hadoop 3.1.0 release discussion

2018-01-31 Thread Daniel Templeton
I added my comments on that JIRA.  Looks like YARN-7292 is marked as a 
blocker for 3.1, and I would tend to agree with that.  Let's see what we 
can to do get profiles nailed down so that 3.1 can go forward.


Daniel

On 1/18/18 10:25 AM, Wangda Tan wrote:

Thanks Daniel,

We need to make a decision about this: 
https://issues.apache.org/jira/browse/YARN-7292, I believe this is the 
JIRA you mentioned correct? Please let me know if there's anything 
else. And let's move the discussion on JIRA.


The good news is that resource profile is merged to trunk already so 
we can finish that before code freeze date (Feb 08).


+ Sunil as well.

Thanks,
Wangda



On Wed, Jan 17, 2018 at 4:31 PM, Daniel Templeton <dan...@cloudera.com 
<mailto:dan...@cloudera.com>> wrote:


What's the status on resource profiles?  I believe there are still
a couple of open JIRAs to rethink some of the design choices.

Daniel


On 1/17/18 11:33 AM, Wangda Tan wrote:

Hi All,

Since we're fast approaching previously proposed feature
freeze date (Jan
30, about 13 days from today). If you've any features which
live in a
branch and targeted to 3.1.0, please reply this email thread.
Ideally, we
should finish branch merging before feature freeze date.

Here's an updated 3.1.0 feature status:

1. Merged & Completed features:
* (Sunil) YARN-5881: Support absolute value in CapacityScheduler.
* (Wangda) YARN-6223: GPU support on YARN. Features in trunk
and works
end-to-end.
* (Jian) YARN-5079,YARN-4793,YARN-4757,YARN-6419 YARN native
services.
* (Steve Loughran): HADOOP-13786: S3Guard committer for
zero-rename commits.
* (Suma): YARN-7117: Capacity Scheduler: Support Auto Creation
of Leaf
Queues While Doing Queue Mapping.
* (Chris Douglas) HDFS-9806: HDFS Tiered Storage.

2. Features close to finish:
* (Zhankun) YARN-5983: FPGA support. Majority implementations
completed and
merged to trunk. Except for UI/documentation.
* (Uma) HDFS-10285: HDFS SPS. Majority implementations are
done, some
discussions going on about implementation.
* (Arun Suresh / Kostas / Wangda). YARN-6592: New
SchedulingRequest and
anti-affinity support. Close to finish, on track to be merged
before Jan 30.

3. Tentative features:
* (Arun Suresh). YARN-5972: Support pausing/freezing opportunistic
containers. Only one pending patch. Plan to finish before Jan 7th.
* (Haibo Chen). YARN-1011: Resource overcommitment. Looks
challenging to be
done before Jan 2018.
* (Anu): HDFS-7240: Ozone. Given the discussion on HDFS-7240.
Looks
challenging to be done before Jan 2018.
* (Varun V) YARN-5673: container-executor write. Given
security refactoring
of c-e (YARN-6623) is already landed, IMHO other stuff may be
moved to 3.2.

Thanks,
Wangda




On Fri, Dec 15, 2017 at 1:20 PM, Wangda Tan
<wheele...@gmail.com <mailto:wheele...@gmail.com>> wrote:

Hi all,

Congratulations on the 3.0.0-GA release!

As we discussed in the previous email thread [1], I'd like
to restart
3.1.0 release plans.

a) Quick summary:
a.1 Release status
We started 3.1 release discussion on Sep 6, 2017 [1]. As
of today,
there’re 232 patches loaded on 3.1.0 alone [2], besides 6
open blockers and
22 open critical issues.

a.2 Release date update
Considering delays of 3.0-GA release by month-and-a-half,
I propose to
move the dates as follows
  - feature freeze date from Dec 15, 2017, to Jan 30, 2018
- last date for
any branches to get merged too;
  - code freeze (blockers & critical only) date to Feb 08,
2018;
  - release voting start by Feb 18, 2018, leaving time for
at least two RCx
  - release date from Jan 15, 2018, to Feb 28, 2018;

Unlike before, I added an additional milestone for
release-vote-start so
that we can account for voting time-period also.

This overall is still 5 1/2 months of release-timeline
unlike the faster
cadence we hoped for, but this, in my opinion, is the
best-updated timeline
given the delays of the final release of 3.0-GA.

b) Individual feature status:
I spoke to several feature owners and checked the status
of un-finished
features, following are status of features planned to 3.1.0:

b.1 Merged & Completed featu

Re: Hadoop 3.1.0 release discussion

2018-01-17 Thread Daniel Templeton
What's the status on resource profiles?  I believe there are still a 
couple of open JIRAs to rethink some of the design choices.


Daniel

On 1/17/18 11:33 AM, Wangda Tan wrote:

Hi All,

Since we're fast approaching previously proposed feature freeze date (Jan
30, about 13 days from today). If you've any features which live in a
branch and targeted to 3.1.0, please reply this email thread. Ideally, we
should finish branch merging before feature freeze date.

Here's an updated 3.1.0 feature status:

1. Merged & Completed features:
* (Sunil) YARN-5881: Support absolute value in CapacityScheduler.
* (Wangda) YARN-6223: GPU support on YARN. Features in trunk and works
end-to-end.
* (Jian) YARN-5079,YARN-4793,YARN-4757,YARN-6419 YARN native services.
* (Steve Loughran): HADOOP-13786: S3Guard committer for zero-rename commits.
* (Suma): YARN-7117: Capacity Scheduler: Support Auto Creation of Leaf
Queues While Doing Queue Mapping.
* (Chris Douglas) HDFS-9806: HDFS Tiered Storage.

2. Features close to finish:
* (Zhankun) YARN-5983: FPGA support. Majority implementations completed and
merged to trunk. Except for UI/documentation.
* (Uma) HDFS-10285: HDFS SPS. Majority implementations are done, some
discussions going on about implementation.
* (Arun Suresh / Kostas / Wangda). YARN-6592: New SchedulingRequest and
anti-affinity support. Close to finish, on track to be merged before Jan 30.

3. Tentative features:
* (Arun Suresh). YARN-5972: Support pausing/freezing opportunistic
containers. Only one pending patch. Plan to finish before Jan 7th.
* (Haibo Chen). YARN-1011: Resource overcommitment. Looks challenging to be
done before Jan 2018.
* (Anu): HDFS-7240: Ozone. Given the discussion on HDFS-7240. Looks
challenging to be done before Jan 2018.
* (Varun V) YARN-5673: container-executor write. Given security refactoring
of c-e (YARN-6623) is already landed, IMHO other stuff may be moved to 3.2.

Thanks,
Wangda




On Fri, Dec 15, 2017 at 1:20 PM, Wangda Tan  wrote:


Hi all,

Congratulations on the 3.0.0-GA release!

As we discussed in the previous email thread [1], I'd like to restart
3.1.0 release plans.

a) Quick summary:
a.1 Release status
We started 3.1 release discussion on Sep 6, 2017 [1]. As of today,
there’re 232 patches loaded on 3.1.0 alone [2], besides 6 open blockers and
22 open critical issues.

a.2 Release date update
Considering delays of 3.0-GA release by month-and-a-half, I propose to
move the dates as follows
  - feature freeze date from Dec 15, 2017, to Jan 30, 2018 - last date for
any branches to get merged too;
  - code freeze (blockers & critical only) date to Feb 08, 2018;
  - release voting start by Feb 18, 2018, leaving time for at least two RCx
  - release date from Jan 15, 2018, to Feb 28, 2018;

Unlike before, I added an additional milestone for release-vote-start so
that we can account for voting time-period also.

This overall is still 5 1/2 months of release-timeline unlike the faster
cadence we hoped for, but this, in my opinion, is the best-updated timeline
given the delays of the final release of 3.0-GA.

b) Individual feature status:
I spoke to several feature owners and checked the status of un-finished
features, following are status of features planned to 3.1.0:

b.1 Merged & Completed features:
* (Sunil) YARN-5881: Support absolute value in CapacityScheduler.
* (Wangda) YARN-6223: GPU support on YARN. Features in trunk and works
end-to-end.
* (Jian) YARN-5079,YARN-4793,YARN-4757,YARN-6419 YARN native services.
* (Steve Loughran): HADOOP-13786: S3Guard committer for zero-rename
commits.
* (Suma): YARN-7117: Capacity Scheduler: Support Auto Creation of Leaf
Queues While Doing Queue Mapping.

b.2 Features close to finish:
* (Chris Douglas) HDFS-9806: HDFS Tiered Storage. Being voting now.
* (Zhankun) YARN-5983: FPGA support. Majority implementations completed
and merged to trunk. Except for UI/documentation.
* (Uma) HDFS-10285: HDFS SPS. Majority implementations are done, some
discussions going on about implementation.

b.3 Tentative features:
* (Arun Suresh). YARN-5972: Support pausing/freezing opportunistic
containers. Only one pending patch. Plan to finish before Jan 7th.
* (Haibo Chen). YARN-1011: Resource overcommitment. Looks challenging to
be done before Jan 2018.
* (Arun Suresh / Kostas / Wangda). YARN-6592: New SchedulingRequest and
anti-affinity support. Tentative will figure out by Jan 1st.
* (Anu): HDFS-7240: Ozone. Given the discussion on HDFS-7240. Looks
challenging to be done before Jan 2018.
* (Varun V) YARN-5673: container-executor write. Given security
refactoring of c-e (YARN-6623) is already landed, IMHO other stuff may be
moved to 3.2.

b.4 Additional release drivers
* More exhaustive upgrade testing from 2.x to 3.x.

c) Regarding branch cut:

We will keep pointing trunk to 3.1 and cut branch-3.1 until: A. some
feature planned to 3.2 has to be landed on trunk or B. After feature freeze
date, whichever comes first.

I've also talked 

Re: Apache Hadoop 2.8.3 Release Plan

2017-11-21 Thread Daniel Templeton
Doh.  Mailer dropped some of the lists.  Replying again to avoid 
fragmenting the discussion...


Still +1 to Andrew's comments.

Daniel

On 11/21/17 7:53 AM, Daniel Templeton wrote:

+1

Daniel

On 11/20/17 10:22 PM, Andrew Wang wrote:
I'm against including new features in maintenance releases, since 
they're

meant to be bug-fix only.

If we're struggling with being able to deliver new features in a safe 
and

timely fashion, let's try to address that, not overload the meaning of
"maintenance release".

Best,
Andrew

On Mon, Nov 20, 2017 at 5:20 PM, Zheng, Kai <kai.zh...@intel.com> wrote:


Hi Junping,

Thank you for making 2.8.2 happen and now planning the 2.8.3 release.

I have an ask, is it convenient to include the back port work for OSS
connector module? We have some Hadoop users that wish to have it by 
default

for convenience, though in the past they used it by back porting
themselves. I have raised this and got thoughts from Chris and 
Steve. Looks
like this is more wanted for 2.9 but I wanted to ask again here for 
broad
feedback and thoughts by this chance. The back port patch is 
available for
2.8 and the one for branch-2 was already in. IMO, 2.8.x is promising 
as we
can see some shift from 2.7.x, hence it's worth more important 
features and

efforts. How would you think? Thanks!

https://issues.apache.org/jira/browse/HADOOP-14964

Regards,
Kai

-Original Message-
From: Junping Du [mailto:j...@hortonworks.com]
Sent: Tuesday, November 14, 2017 9:02 AM
To: common-...@hadoop.apache.org; hdfs-...@hadoop.apache.org;
mapreduce-dev@hadoop.apache.org; yarn-...@hadoop.apache.org
Subject: Apache Hadoop 2.8.3 Release Plan

Hi,
 We have several important fixes get landed on branch-2.8 and I 
would

like to cut off branch-2.8.3 now to start 2.8.3 release work.
 So far, I don't see any pending blockers on 2.8.3, so my 
current plan

is to cut off 1st RC of 2.8.3 in next several days:
  -  For all coming commits to land on branch-2.8, please 
mark the

fix version as 2.8.4.
  -  If there is a really important fix for 2.8.3 and getting
closed, please notify me ahead before landing it on branch-2.8.3.
 Please let me know if you have any thoughts or comments on the 
plan.


Thanks,

Junping

From: dujunp...@gmail.com <dujunp...@gmail.com> on behalf of 俊平堵 <
junping...@apache.org>
Sent: Friday, October 27, 2017 3:33 PM
To: gene...@hadoop.apache.org
Subject: [ANNOUNCE] Apache Hadoop 2.8.2 Release.

Hi all,

 It gives me great pleasure to announce that the Apache Hadoop
community has voted to release Apache Hadoop 2.8.2, which is now 
available
for download from Apache mirrors[1]. For download instructions 
please refer

to the Apache Hadoop Release page [2].

Apache Hadoop 2.8.2 is the first GA release of Apache Hadoop 2.8 
line and

our newest stable release for entire Apache Hadoop project. For major
changes incuded in Hadoop 2.8 line, please refer Hadoop 2.8.2 main 
page[3].


This release has 315 resolved issues since previous 2.8.1 release with
following
breakdown:
    - 91 in Hadoop Common
    - 99 in HDFS
    - 105 in YARN
    - 20 in MapReduce
Please read the log of CHANGES[4] and RELEASENOTES[5] for more details.

The release news is posted on the Hadoop website too, you can go to the
downloads section directly [6].

Thank you all for contributing to the Apache Hadoop release!


Cheers,

Junping


[1] http://www.apache.org/dyn/closer.cgi/hadoop/common

[2] http://hadoop.apache.org/releases.html

[3] http://hadoop.apache.org/docs/r2.8.2/index.html

[4]
http://hadoop.apache.org/docs/r2.8.2/hadoop-project-dist/
hadoop-common/release/2.8.2/CHANGES.2.8.2.html

[5]
http://hadoop.apache.org/docs/r2.8.2/hadoop-project-dist/
hadoop-common/release/2.8.2/RELEASENOTES.2.8.2.html

[6] http://hadoop.apache.org/releases.html#Download


-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org


-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org







-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



Re: Apache Hadoop 2.8.3 Release Plan

2017-11-21 Thread Daniel Templeton

+1

Daniel

On 11/20/17 10:22 PM, Andrew Wang wrote:

I'm against including new features in maintenance releases, since they're
meant to be bug-fix only.

If we're struggling with being able to deliver new features in a safe and
timely fashion, let's try to address that, not overload the meaning of
"maintenance release".

Best,
Andrew

On Mon, Nov 20, 2017 at 5:20 PM, Zheng, Kai  wrote:


Hi Junping,

Thank you for making 2.8.2 happen and now planning the 2.8.3 release.

I have an ask, is it convenient to include the back port work for OSS
connector module? We have some Hadoop users that wish to have it by default
for convenience, though in the past they used it by back porting
themselves. I have raised this and got thoughts from Chris and Steve. Looks
like this is more wanted for 2.9 but I wanted to ask again here for broad
feedback and thoughts by this chance. The back port patch is available for
2.8 and the one for branch-2 was already in. IMO, 2.8.x is promising as we
can see some shift from 2.7.x, hence it's worth more important features and
efforts. How would you think? Thanks!

https://issues.apache.org/jira/browse/HADOOP-14964

Regards,
Kai

-Original Message-
From: Junping Du [mailto:j...@hortonworks.com]
Sent: Tuesday, November 14, 2017 9:02 AM
To: common-...@hadoop.apache.org; hdfs-...@hadoop.apache.org;
mapreduce-dev@hadoop.apache.org; yarn-...@hadoop.apache.org
Subject: Apache Hadoop 2.8.3 Release Plan

Hi,
 We have several important fixes get landed on branch-2.8 and I would
like to cut off branch-2.8.3 now to start 2.8.3 release work.
 So far, I don't see any pending blockers on 2.8.3, so my current plan
is to cut off 1st RC of 2.8.3 in next several days:
  -  For all coming commits to land on branch-2.8, please mark the
fix version as 2.8.4.
  -  If there is a really important fix for 2.8.3 and getting
closed, please notify me ahead before landing it on branch-2.8.3.
 Please let me know if you have any thoughts or comments on the plan.

Thanks,

Junping

From: dujunp...@gmail.com  on behalf of 俊平堵 <
junping...@apache.org>
Sent: Friday, October 27, 2017 3:33 PM
To: gene...@hadoop.apache.org
Subject: [ANNOUNCE] Apache Hadoop 2.8.2 Release.

Hi all,

 It gives me great pleasure to announce that the Apache Hadoop
community has voted to release Apache Hadoop 2.8.2, which is now available
for download from Apache mirrors[1]. For download instructions please refer
to the Apache Hadoop Release page [2].

Apache Hadoop 2.8.2 is the first GA release of Apache Hadoop 2.8 line and
our newest stable release for entire Apache Hadoop project. For major
changes incuded in Hadoop 2.8 line, please refer Hadoop 2.8.2 main page[3].

This release has 315 resolved issues since previous 2.8.1 release with
following
breakdown:
- 91 in Hadoop Common
- 99 in HDFS
- 105 in YARN
- 20 in MapReduce
Please read the log of CHANGES[4] and RELEASENOTES[5] for more details.

The release news is posted on the Hadoop website too, you can go to the
downloads section directly [6].

Thank you all for contributing to the Apache Hadoop release!


Cheers,

Junping


[1] http://www.apache.org/dyn/closer.cgi/hadoop/common

[2] http://hadoop.apache.org/releases.html

[3] http://hadoop.apache.org/docs/r2.8.2/index.html

[4]
http://hadoop.apache.org/docs/r2.8.2/hadoop-project-dist/
hadoop-common/release/2.8.2/CHANGES.2.8.2.html

[5]
http://hadoop.apache.org/docs/r2.8.2/hadoop-project-dist/
hadoop-common/release/2.8.2/RELEASENOTES.2.8.2.html

[6] http://hadoop.apache.org/releases.html#Download


-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org


-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org





-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



Re: Hadoop Compatability Guide, Part Deux: Developer Docs

2017-10-30 Thread Daniel Templeton
We've now gone a couple of rounds of reviews on HADOOP-14876, and a 
patch is posted for HADOOP-14875.  Feedback is very welcome.  Please 
take a look.


Daniel

On 10/14/17 8:46 AM, Daniel Templeton wrote:
I just posted a first patch for HADOOP-14876 that adds downstream 
developer docs based on the overhauled compatibility guide from 
HADOOP-13714.  I would really appreciate some critical review of the 
doc, as it's much more likely to be read by downstream developers than 
the compatibility spec itself.


There's another doc coming in HADOOP-14875 that will add what amounts 
to upgrade docs for admins.  When that's complete, I will send another 
email here to solicit reviews.


Daniel



-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



Re: [VOTE] Release Apache Hadoop 2.8.2 (RC0)

2017-09-11 Thread Daniel Templeton
YARN-6622 is now committed to 2.9.  We could backport YARN-5258 and 
YARN-6622 for 2.8, but it'll take some editing.  We'll have to check to 
see what features are unsupported in 2.8 and remove those from the 
docs.  Not a huge effort overall, though.  Probably a hour's work.  I 
may have time to try do it later this week.  Anyone else want to volunteer?


Daniel

On 9/11/17 3:01 PM, Chris Douglas wrote:

On Mon, Sep 11, 2017 at 2:52 PM, Junping Du  wrote:

I don't think this -1 is reasonable, because:
- If you look at YARN-6622 closely, it targets to fix a problematic 
documentation work on YARN-5258 which get checked into 2.9 and 3.0 branch only. 
It means it targets to fix a problem that 2.8.2 never exists.

...we're not going to document security implications- which include
escalations to root- because we don't have _any_ documentation? Why
don't we backport the documentation?


- New docker container support (replace of old DockerContainerExectutor) is 
still an alpha feature now which doesn't highlight in 2.8 major 
features/improvement (http://hadoop.apache.org/docs/r2.8.0/index.html). So 
adding documentation here is also not a blocker.

YARN-6622 is *documenting* the fact that this is an alpha feature and
that it shouldn't be enabled in secure environments. How are users
supposed to make this determination without it?


Vote still continue until a real blocker comes.

Soright. I remain -1. -C



From: Chris Douglas 
Sent: Monday, September 11, 2017 12:00 PM
To: Junping Du
Cc: Miklos Szegedi; Mingliang Liu; Hadoop Common; Hdfs-dev; 
mapreduce-dev@hadoop.apache.org; yarn-...@hadoop.apache.org; junping_du
Subject: Re: [VOTE] Release Apache Hadoop 2.8.2 (RC0)

-1 (binding)

I don't think we should release this without YARN-6622.

Since this doesn't happen often: a -1 in this case is NOT a veto.
Releases are approved by majority vote of the PMC. -C

On Mon, Sep 11, 2017 at 11:45 AM, Junping Du  wrote:

Thanks Mikols for notifying on this. I think docker support is general known as 
alpha feature so document it as experimental is nice to have but not a blocker 
for 2.8.2. I also noticed that our 2.7.x document 
(https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/DockerContainerExecutor.html)
 without mentioning docker support is experimental. We may need to fix that as 
well in following releases.

I can also add it (mentioning docker container support feature is experimental) 
to release message in public website just like previous release we call 
2.7.0/2.8.0 as non-production release.

I think vote should continue until we could find a real blocker.


Thanks,


Junping



From: Miklos Szegedi 
Sent: Monday, September 11, 2017 10:07 AM
To: Mingliang Liu
Cc: Hadoop Common; Hdfs-dev; mapreduce-dev@hadoop.apache.org; 
yarn-...@hadoop.apache.org; junping_du; Junping Du
Subject: Re: [VOTE] Release Apache Hadoop 2.8.2 (RC0)

Hello Junping,

Thank you for working on this. Should not YARN-6622 be addressed first? "Summary: 
Document Docker work as experimental".

Thank you,
Miklos


On Sun, Sep 10, 2017 at 6:39 PM, Mingliang Liu 
> wrote:
Thanks Junping for doing this!

+1 (non-binding)

- Download the hadoop-2.8.2-src.tar.gz file and checked the md5 value
- Build package using maven (skipping tests) with Java 8
- Spin up a test cluster in Docker containers having 1 master node (NN/RM) and 
3 slave nodes (DN/NM)
- Operate the basic HDFS/YARN operations from command line, both client and 
admin
- Check NN/RM Web UI
- Run distcp to copy files from/to local and HDFS
- Run hadoop mapreduce examples: grep and wordcount
- Check the HDFS service logs

All looked good to me.

Mingliang


On Sep 10, 2017, at 5:00 PM, Junping Du 
> wrote:

Hi folks,
 With fix of HADOOP-14842 get in, I've created our first release candidate 
(RC0) for Apache Hadoop 2.8.2.

 Apache Hadoop 2.8.2 is the first stable release of Hadoop 2.8 line and 
will be the latest stable/production release for Apache Hadoop - it includes 
305 new fixed issues since 2.8.1 and 63 fixes are marked as blocker/critical 
issues.

  More information about the 2.8.2 release plan can be found here: 
https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+2.8+Release

  New RC is available at: 
http://home.apache.org/~junping_du/hadoop-2.8.2-RC0

  The RC tag in git is: release-2.8.2-RC0, and the latest commit id is: 
e6597fe3000b06847d2bf55f2bab81770f4b2505

  The maven artifacts are available via 
repository.apache.org at: 
https://repository.apache.org/content/repositories/orgapachehadoop-1062

  Please try the release and vote; the vote will run for the usual 5 days, 
ending on 09/15/2017 5pm PST time.

Thanks,

Junping




Re: DISCUSS: Hadoop Compatability Guidelines

2017-09-07 Thread Daniel Templeton
Good point.  I think it would be valuable to enumerate the policies 
around the versioned state stores.  We have the three you listed. We 
should probably include the HDFS fsimage in that list.  Any others?


I also want to add a section that clarifies when it's OK to change the 
visibility or audience of an API.


Daniel

On 9/5/17 11:04 AM, Arun Suresh wrote:

Thanks for starting this Daniel.

I think we should also add a section for store compatibility (all state
stores including RM, NM, Federation etc.). Essentially an explicit policy
detailing when is it ok to change the major and minor versions and how it
should relate to the hadoop release version.
Thoughts ?

Cheers
-Arun


On Tue, Sep 5, 2017 at 10:38 AM, Daniel Templeton <dan...@cloudera.com>
wrote:


Good idea.  I should have thought of that. :)  Done.

Daniel


On 9/5/17 10:33 AM, Anu Engineer wrote:


Could you please attach the PDFs to the JIRA. I think the mailer is
stripping them off from the mail.

Thanks
Anu





On 9/5/17, 9:44 AM, "Daniel Templeton" <dan...@cloudera.com> wrote:

Resending with a broader audience, and reattaching the PDFs.

Daniel

On 9/4/17 9:01 AM, Daniel Templeton wrote:


All, in prep for Hadoop 3 beta 1 I've been working on updating the
compatibility guidelines on HADOOP-13714.  I think the initial doc is
more or less complete, so I'd like to open the discussion up to the
broader Hadoop community.

In the new guidelines, I have drawn some lines in the sand regarding
compatibility between releases.  In some cases these lines are more
restrictive than the current practices.  The intent with the new
guidelines is not to limit progress by restricting what goes into a
release, but rather to drive release numbering to keep in line with
the reality of the code.

Please have a read and provide feedback on the JIRA.  I'm sure there
are more than a couple of areas that could be improved.  If you'd
rather not read markdown from a diff patch, I've attached PDFs of the
two modified docs.

Thanks!
Daniel




-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org





-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



Re: DISCUSS: Hadoop Compatability Guidelines

2017-09-05 Thread Daniel Templeton

Good idea.  I should have thought of that. :)  Done.

Daniel

On 9/5/17 10:33 AM, Anu Engineer wrote:

Could you please attach the PDFs to the JIRA. I think the mailer is stripping 
them off from the mail.

Thanks
Anu





On 9/5/17, 9:44 AM, "Daniel Templeton" <dan...@cloudera.com> wrote:


Resending with a broader audience, and reattaching the PDFs.

Daniel

On 9/4/17 9:01 AM, Daniel Templeton wrote:

All, in prep for Hadoop 3 beta 1 I've been working on updating the
compatibility guidelines on HADOOP-13714.  I think the initial doc is
more or less complete, so I'd like to open the discussion up to the
broader Hadoop community.

In the new guidelines, I have drawn some lines in the sand regarding
compatibility between releases.  In some cases these lines are more
restrictive than the current practices.  The intent with the new
guidelines is not to limit progress by restricting what goes into a
release, but rather to drive release numbering to keep in line with
the reality of the code.

Please have a read and provide feedback on the JIRA.  I'm sure there
are more than a couple of areas that could be improved.  If you'd
rather not read markdown from a diff patch, I've attached PDFs of the
two modified docs.

Thanks!
Daniel





-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



Re: DISCUSS: Hadoop Compatability Guidelines

2017-09-05 Thread Daniel Templeton

Resending with a broader audience, and reattaching the PDFs.

Daniel

On 9/4/17 9:01 AM, Daniel Templeton wrote:
All, in prep for Hadoop 3 beta 1 I've been working on updating the 
compatibility guidelines on HADOOP-13714.  I think the initial doc is 
more or less complete, so I'd like to open the discussion up to the 
broader Hadoop community.


In the new guidelines, I have drawn some lines in the sand regarding 
compatibility between releases.  In some cases these lines are more 
restrictive than the current practices.  The intent with the new 
guidelines is not to limit progress by restricting what goes into a 
release, but rather to drive release numbering to keep in line with 
the reality of the code.


Please have a read and provide feedback on the JIRA.  I'm sure there 
are more than a couple of areas that could be improved.  If you'd 
rather not read markdown from a diff patch, I've attached PDFs of the 
two modified docs.


Thanks!
Daniel




-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org

Re: [VOTE] Merge YARN-3926 (resource profile) to trunk

2017-08-26 Thread Daniel Templeton
Quick question, Wangda.  When you say that the feature can be turned 
off, do you mean resource types or resource profiles?  I know there's an 
off-by-default property that governs resource profiles, but I didn't see 
any way to turn off resource types.  Even if only CPU and memory are 
configured, i.e. no additional resource types, the code path is 
different than it was.  Specifically, where CPU and memory were 
primitives before, they're now entries in an array whose indexes have to 
be looked up through the ResourceUtils class.  Did I miss something?


For those who haven't followed the feature closely, there are really two 
features here.  Resource types allows for declarative extension of the 
resource system in YARN.  Resource profiles builds on top of resource 
types to allow a user to request a group of resources as a profile, much 
like EC2 instance types, e.g. "fast-compute" might mean 32GB RAM, 8 
vcores, and 2 GPUs.


Daniel

On 8/23/17 11:49 AM, Wangda Tan wrote:

  Hi folks,

Per earlier discussion [1], I'd like to start a formal vote to merge
feature branch YARN-3926 (Resource profile) to trunk. The vote will run for
7 days and will end August 30 10:00 AM PDT.

Briefly, YARN-3926 can extend resource model of YARN to support resource
types other than CPU and memory, so it will be a cornerstone of features
like GPU support (YARN-6223), disk scheduling/isolation (YARN-2139), FPGA
support (YARN-5983), network IO scheduling/isolation (YARN-2140). In
addition to that, YARN-3926 allows admin to preconfigure resource profiles
in the cluster, for example, m3.large means <2 vcores, 8 GB memory, 64 GB
disk>, so applications can request "m3.large" profile instead of specifying
all resource types’s values.

There are 32 subtasks that were completed as part of this effort.

This feature needs to be explicitly turned on before use. We paid close
attention to compatibility, performance, and scalability of this feature,
mentioned in [1], we didn't see observable performance regression in large
scale SLS (scheduler load simulator) executions and saw less than 5%
performance regression by using micro benchmark added by YARN-6775.

This feature works from end-to-end (including UI/CLI/application/server),
we have setup a cluster with this feature turned on runs for several weeks,
we didn't see any issues by far.

Merge JIRA: YARN-7013 (Jenkins gave +1 already).
Documentation: YARN-7056

Special thanks to a team of folks who worked hard and contributed towards
this effort including design discussion/development/reviews, etc.: Varun
Vasudev, Sunil Govind, Daniel Templeton, Vinod Vavilapalli, Yufei Gu,
Karthik Kambatla, Jason Lowe, Arun Suresh.

Regards,
Wangda Tan

[1]
http://mail-archives.apache.org/mod_mbox/hadoop-yarn-dev/201708.mbox/%3CCAD%2B%2BeCnjEHU%3D-M33QdjnND0ZL73eKwxRua4%3DBbp4G8inQZmaMg%40mail.gmail.com%3E




-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



Re: Map reduce sample program

2017-08-22 Thread Daniel Templeton

On 8/19/17 3:28 AM, Remil Mohanan wrote:

I am trying to pass multiple non key values from mapper to reducer.


The only way to pass data from the mapper to the reducer is through 
passing key-values.  One common trick is to designate a special key as 
the out-of-band information key and then use a custom sorting comparator 
to make sure that key comes first in the sort order.  I'm sure you can 
find examples online.



Similarly for reading and writing a file inside the hdfs system other than 
normal read and write.



I don't understand.  Reading and writing a file in HDFS from an MR task 
works exactly the same as doing it from a stand-alone program. You 
probably want to do it in the setup() method, though.


Daniel

-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



Re: Map reduce sample program

2017-08-16 Thread Daniel Templeton

Can you clarify what you mean for #1?  For #2, try this:

https://tutorials.techmytalk.com/2014/08/16/hadoop-hdfs-java-api/

Daniel

On 8/16/17 8:17 AM, Remil Mohanan wrote:







Hi there,

Please help me to get a sample program for each scenario.

1) need a Java map reducer sample program where multiple parameters
are passed from mapper to reducer.
2) need a Java map reducer program where there is a write to a file inside
hdfs filesystem as well as a read from a file inside hdfs other than
the normal input file and output file mentioned in the mapper and reducer.

Have a nice day

Thanks

Remil

-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org




-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-6914) Tests use assertTrue(....equals(...)) instead of assertEquals()

2017-07-17 Thread Daniel Templeton (JIRA)
Daniel Templeton created MAPREDUCE-6914:
---

 Summary: Tests use assertTrue(equals(...)) instead of 
assertEquals()
 Key: MAPREDUCE-6914
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6914
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: test
Affects Versions: 3.0.0-alpha4, 2.8.1
Reporter: Daniel Templeton
Assignee: Daniel Templeton
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-6883) AuditLogger and TestAuditLogger are dead code

2017-05-03 Thread Daniel Templeton (JIRA)
Daniel Templeton created MAPREDUCE-6883:
---

 Summary: AuditLogger and TestAuditLogger are dead code
 Key: MAPREDUCE-6883
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6883
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.8.0
Reporter: Daniel Templeton
Priority: Minor


The {{AuditLogger}} and {{TestAuditLogger}} classes appear to be dead code.  I 
can't find anything that uses or references {{AuditLogger}}.  No one has 
touched the code 2011.  I think it's safe to remove.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



Re: Skip bad records when streaming supported?

2017-04-13 Thread Daniel Templeton

To quote the docs:

---
This feature can be used when map/reduce tasks crashes deterministically 
on certain input. This happens due to bugs in the map/reduce function. 
The usual course would be to fix these bugs. But sometimes this is not 
possible; perhaps the bug is in third party libraries for which the 
source code is not available. Due to this, the task never reaches to 
completion even with multiple attempts and complete data for that task 
is lost.


With this feature, only a small portion of data is lost surrounding the 
bad record, which may be acceptable for some user applications. see 
setMapperMaxSkipRecords(Configuration, long)

---

Basically, it's a heavy-handed approach that you should only use as a 
last resort.


Daniel


On 4/13/17 3:24 PM, Pillis W wrote:

Thanks Daniel.

Please correct me if I have understood this incorrectly, but according 
to the documentation at 
http://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Skipping_Bad_Records 
, it seemed like the sole purpose of this functionality is to tolerate 
unknown failures/exceptions in mappers/reducers. If I was able to 
catch all failures, I do not need to even use this ability - is that 
not true?


If I have understood it incorrectly, when would one use the feature to 
skip bad records?


Regards,
PW




On Thu, Apr 13, 2017 at 2:49 PM, Daniel Templeton <dan...@cloudera.com 
<mailto:dan...@cloudera.com>> wrote:


You have to modify wordcount-mapper-t1.py to just ignore the bad
line.  In the worst case, you should be able to do something like:

for line in sys.stdin:
  try:
# Insert processing code here
  except:
# Error processing record, ignore it
pass

Daniel


On 4/13/17 1:33 PM, Pillis W wrote:

Hello,
I am using 'hadoop-streaming.jar' to do a simple word count,
and want to
skip records that fail execution. Below is the actual command
I run, and
the mapper always fails on one record, and hence fails the
job. The input
file is 3 lines with 1 bad line.

hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D
mapred.job.name <http://mapred.job.name>=SkipTest
-Dmapreduce.task.skip.start.at
<http://Dmapreduce.task.skip.start.at>tempts=1
-Dmapreduce.map.skip.maxrecords=1
-Dmapreduce.reduce.skip.maxgroups=1
-Dmapreduce.map.skip.proc.count.autoincr=false
-Dmapreduce.reduce.skip.proc.count.autoincr=false -D
mapred.reduce.tasks=1
-D mapred.map.tasks=1 -files

/home/hadoop/wc/wordcount-mapper-t1.py,/home/hadoop/wc/wordcount-reducer-t1.py
-input /user/hadoop/data/test1 -output
/user/hadoop/data/output-test-5
-mapper "python wordcount-mapper-t1.py" -reducer "python
wordcount-reducer-t1.py"


I was wondering if skipping of records is supported when
MapReduce is used
in streaming mode?

Thanks in advance.
PW



-
To unsubscribe, e-mail:
mapreduce-dev-unsubscr...@hadoop.apache.org
<mailto:mapreduce-dev-unsubscr...@hadoop.apache.org>
For additional commands, e-mail:
mapreduce-dev-h...@hadoop.apache.org
<mailto:mapreduce-dev-h...@hadoop.apache.org>





-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



Re: Skip bad records when streaming supported?

2017-04-13 Thread Daniel Templeton
You have to modify wordcount-mapper-t1.py to just ignore the bad line.  
In the worst case, you should be able to do something like:


for line in sys.stdin:
  try:
# Insert processing code here
  except:
# Error processing record, ignore it
pass

Daniel

On 4/13/17 1:33 PM, Pillis W wrote:

Hello,
I am using 'hadoop-streaming.jar' to do a simple word count, and want to
skip records that fail execution. Below is the actual command I run, and
the mapper always fails on one record, and hence fails the job. The input
file is 3 lines with 1 bad line.

hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D mapred.job.name=SkipTest
-Dmapreduce.task.skip.start.attempts=1 -Dmapreduce.map.skip.maxrecords=1
-Dmapreduce.reduce.skip.maxgroups=1
-Dmapreduce.map.skip.proc.count.autoincr=false
-Dmapreduce.reduce.skip.proc.count.autoincr=false -D mapred.reduce.tasks=1
-D mapred.map.tasks=1 -files
/home/hadoop/wc/wordcount-mapper-t1.py,/home/hadoop/wc/wordcount-reducer-t1.py
-input /user/hadoop/data/test1 -output /user/hadoop/data/output-test-5
-mapper "python wordcount-mapper-t1.py" -reducer "python
wordcount-reducer-t1.py"


I was wondering if skipping of records is supported when MapReduce is used
in streaming mode?

Thanks in advance.
PW




-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



Re: [VOTE] Release Apache Hadoop 2.8.0 (RC3)

2017-03-17 Thread Daniel Templeton
Thanks for the new RC, Junping.  I built from source and tried it out on 
a 2-node cluster with HA enabled.  I ran a pi job and some streaming 
jobs.  I tested that localization and failover work correctly, and I 
played a little with the YARN and HDFS web UIs.


I did encounter an old friend of mine, which is that if you submit a 
streaming job with input that is only 1 block, you will nonetheless get 
2 mappers that both process the same split. What's new this time is that 
the second mapper was consistently failing on certain input sizes.  I 
(re)verified that the issue also exists is 2.7.3, so it's not a 
regression.  I'm pretty sure it's been there since at least 2.6.0.  I 
filed MAPREDUCE-6864 for it.


Given that my issue was not a regression, I'm +1 on the RC.

Daniel

On 3/17/17 2:18 AM, Junping Du wrote:

Hi all,
  With fix of HDFS-11431 get in, I've created a new release candidate (RC3) 
for Apache Hadoop 2.8.0.

  This is the next minor release to follow up 2.7.0 which has been released 
for more than 1 year. It comprises 2,900+ fixes, improvements, and new 
features. Most of these commits are released for the first time in branch-2.

   More information about the 2.8.0 release plan can be found here: 
https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+2.8+Release

   New RC is available at: 
http://home.apache.org/~junping_du/hadoop-2.8.0-RC3

   The RC tag in git is: release-2.8.0-RC3, and the latest commit id is: 
91f2b7a13d1e97be65db92ddabc627cc29ac0009

   The maven artifacts are available via repository.apache.org at: 
https://repository.apache.org/content/repositories/orgapachehadoop-1057

   Please try the release and vote; the vote will run for the usual 5 days, 
ending on 03/22/2017 PDT time.

Thanks,

Junping




-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-6864) Hadoop streaming creates 2 mappers when the input has only one block

2017-03-17 Thread Daniel Templeton (JIRA)
Daniel Templeton created MAPREDUCE-6864:
---

 Summary: Hadoop streaming creates 2 mappers when the input has 
only one block
 Key: MAPREDUCE-6864
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6864
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 2.7.3
Reporter: Daniel Templeton


If a streaming job is run against input that is less than 2 blocks, 2 mappers 
will be created, both operating on the same split, both producing (duplicate) 
output.  In some cases the second mapper will consistently fail.  I've not seen 
the failure with input less than 10 bytes or more than a couple MB.  I have 
seen it with a 4kB input.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-6848) MRApps.setMRFrameworkClasspath() unnecessarily declares that it throws IOException

2017-02-15 Thread Daniel Templeton (JIRA)
Daniel Templeton created MAPREDUCE-6848:
---

 Summary: MRApps.setMRFrameworkClasspath() unnecessarily declares 
that it throws IOException
 Key: MAPREDUCE-6848
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6848
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: mrv2
Affects Versions: 2.8.0
Reporter: Daniel Templeton
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-6837) Add an equivalent to Crunch's Pair class

2017-01-26 Thread Daniel Templeton (JIRA)
Daniel Templeton created MAPREDUCE-6837:
---

 Summary: Add an equivalent to Crunch's Pair class
 Key: MAPREDUCE-6837
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6837
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: mrv2
Reporter: Daniel Templeton


Crunch has this great {{Pair}} class 
(https://crunch.apache.org/apidocs/0.14.0/org/apache/crunch/Pair.html) that 
save you from constantly implementing composite writables.  It seems silly that 
we still don't have an equivalent in MR.

I would like to see a new class with the following API:

{code}
package org.apache.hadoop.io;

public class CompositeWritable implements WritableComparable {
  public CompositeWritable(P primary, S secondary);
  public P getPrimary();
  public void setPrimary(P primary);
  public S getSecondary();
  public void setSecondary(S secondary);

  // Return true if both primaries and both secondaries are equal
  public boolean equals(CompositeWritable o);

  // Return the primary's hash code
  public long hashCode();

  // Sort first by primary and then by secondary
  public int compareTo(CompositeWritable o);

  public void readFields(DataInput in);
  public void write(DataOutput out);
}
{code}

With such a class, implementing a secondary sort would mean just implementing a 
custom grouping comparator.  That comparator could be implemented as part of 
this JIRA:

{code}
package org.apache.hadoop.io;

public class CompositeGroupingComparator extends WritableComparator {
  ...
}
{code}

Or some such.

Crunch also provides {{Tuple3}}, {{Tuple4}}, and {{TupleN}} classes, but I 
don't think we need to add equivalents.  If someone really wants that 
capability, they can nest composite keys.

Don't forget to add unit tests!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



Re: [VOTE] Release cadence and EOL

2017-01-17 Thread Daniel Templeton
Thanks for driving this, Sangjin. Quick question, though: the subject 
line is "Release cadence and EOL," but I don't see anything about 
cadence in the proposal.  Did I miss something?


Daniel

On 1/17/17 8:35 AM, Sangjin Lee wrote:

Following up on the discussion thread on this topic (
https://s.apache.org/eFOf), I'd like to put the proposal for a vote for the
release cadence and EOL. The proposal is as follows:

"A minor release line is end-of-lifed 2 years after it is released or there
are 2 newer minor releases, whichever is sooner. The community reserves the
right to extend or shorten the life of a release line if there is a good
reason to do so."

This also entails that we the Hadoop community commit to following this
practice and solving challenges to make it possible. Andrew Wang laid out
some of those challenges and what can be done in the discussion thread
mentioned above.

I'll set the voting period to 7 days. I understand a majority rule would
apply in this case. Your vote is greatly appreciated, and so are
suggestions!

Thanks,
Sangjin




-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



[jira] [Resolved] (MAPREDUCE-6827) Failed to traverse Iterable values the second time in reduce() method

2017-01-03 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton resolved MAPREDUCE-6827.
-
Resolution: Not A Problem

That is known, documented, and intended behavior.  The {{ValueIterator}}'s 
{{hasNext()}} and {{next()}} methods defer defer to the {{ReduceContextImpl}}'s 
{{BackupStore}} instance, so creating a new iterator won't help.  The reason we 
only go through the values once is to allow the data to be efficiently streamed.

> Failed to traverse Iterable values the second time in reduce() method
> -
>
> Key: MAPREDUCE-6827
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6827
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 3.0.0-alpha1
> Environment: hadoop2.7.3
>Reporter: javaloveme
>
> Failed to traverse Iterable values the second time in reduce() method
> The following code is a reduce() method (of WordCount):
> {code:title=WordCount.java|borderStyle=solid}
>   public static class WcReducer extends Reducer<Text, IntWritable, Text, 
> IntWritable> {
>   @Override
>   protected void reduce(Text key, Iterable values, 
> Context context)
>   throws IOException, InterruptedException {
>   // print some logs
>   List vals = new LinkedList<>();
>   for(IntWritable i : values) {
>   vals.add(i.toString());
>   }
>   System.out.println(String.format(">>>> reduce(%s, 
> [%s])",
>   key, String.join(", ", vals)));
>   // sum of values
>   int sum = 0;
>   for(IntWritable i : values) {
>   sum += i.get();
>   }
>   System.out.println(String.format(">>>> reduced(%s, %s)",
>   key, sum));
>   
>   context.write(key, new IntWritable(sum));
>   }   
>   }
> {code}
> After running it, we got the result that all sums were zero!
> After debugging, it was found that the second foreach-loop was not executed, 
> and the root cause was the returned value of Iterable.iterator(), it returned 
> the same instance in the two calls called by foreach-loop. In general, 
> Iterable.iterator() should return a new instance in each call, such as 
> ArrayList.iterator().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-6776) yarn.app.mapreduce.client.job.max-retries should have a more useful default

2016-09-12 Thread Daniel Templeton (JIRA)
Daniel Templeton created MAPREDUCE-6776:
---

 Summary: yarn.app.mapreduce.client.job.max-retries should have a 
more useful default
 Key: MAPREDUCE-6776
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6776
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.8.0
Reporter: Daniel Templeton
Assignee: Daniel Templeton


The default is 0, so any communication results in a client failure.  Oozie 
doesn't like that.  If the RM is failing over and Oozie gets a communication 
failure, it assumes the target job has failed.  I propose raising the default 
to something modest like 3 or 5.  The default retry interval is 2s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



[jira] [Resolved] (MAPREDUCE-6560) ClientServiceDelegate doesn't handle retries during AM restart as intended

2016-08-31 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton resolved MAPREDUCE-6560.
-
Resolution: Invalid

Looks like I was just wrong.

> ClientServiceDelegate doesn't handle retries during AM restart as intended
> --
>
> Key: MAPREDUCE-6560
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6560
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>    Reporter: Daniel Templeton
>    Assignee: Daniel Templeton
>
> In the {{invoke()}} method, I found the following code:
> {code}
>   private AtomicBoolean usingAMProxy = new AtomicBoolean(false);
> ...
> // if it's AM shut down, do not decrement maxClientRetry as we wait 
> for
> // AM to be restarted.
> if (!usingAMProxy.get()) {
>   maxClientRetry--;
> }
> usingAMProxy.set(false);
> {code}
> When we create the AM proxy, we set the flag to true.  If we fail to connect, 
> the impact of the flag being true is that the code will try one extra time, 
> giving it 400ms instead of just 300ms.  I can't imagine that's the intended 
> behavior.  After any failure, the flag will forever more be false, but 
> fortunately (?!?) the flag is otherwise unused.
> Looks like I need to do some archeology to figure out how we ended up here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



Re: [VOTE] Release Apache Hadoop 2.7.3 RC0

2016-07-26 Thread Daniel Templeton
I just downloaded the build tarball and deployed it on a 2-node 
cluster.  It looks to me like it's compiled for the wrong platform:


# file /usr/lib/hadoop/bin/container-executor
/usr/lib/hadoop/bin/container-executor: setuid setgid Mach-O 64-bit 
executable


I'm also seeing the no-native-libraries warnings.

Daniel

On 7/26/16 6:12 PM, Rushabh Shah wrote:

Thanks Vinod for all the release work !
+1 (non-binding).
* Downloaded from source and built it.* Deployed a pseudo distributed cluster.
* Ran some sample jobs: sleep, pi* Ran some dfs commands.* Everything works 
fine.
  


 On Friday, July 22, 2016 9:16 PM, Vinod Kumar Vavilapalli 
 wrote:
  


  Hi all,

I've created a release candidate RC0 for Apache Hadoop 2.7.3.

As discussed before, this is the next maintenance release to follow up 2.7.2.

The RC is available for validation at: 
http://home.apache.org/~vinodkv/hadoop-2.7.3-RC0/ 


The RC tag in git is: release-2.7.3-RC0

The maven artifacts are available via repository.apache.org 
 at 
https://repository.apache.org/content/repositories/orgapachehadoop-1040/ 


The release-notes are inside the tar-balls at location 
hadoop-common-project/hadoop-common/src/main/docs/releasenotes.html. I hosted this at 
http://home.apache.org/~vinodkv/hadoop-2.7.3-RC0/releasenotes.html 
 for your 
quick perusal.

As you may have noted, a very long fix-cycle for the License & Notice issues 
(HADOOP-12893) caused 2.7.3 (along with every other Hadoop release) to slip by 
quite a bit. This release's related discussion thread is linked below: [1].

Please try the release and vote; the vote will run for the usual 5 days.

Thanks,
Vinod

[1]: 2.7.3 release plan: 
https://www.mail-archive.com/hdfs-dev%40hadoop.apache.org/msg24439.html 







-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-6719) -libjars should use wildcards to reduce the application footprint in the state store

2016-06-20 Thread Daniel Templeton (JIRA)
Daniel Templeton created MAPREDUCE-6719:
---

 Summary: -libjars should use wildcards to reduce the application 
footprint in the state store
 Key: MAPREDUCE-6719
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6719
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distributed-cache
Affects Versions: 2.8.0
Reporter: Daniel Templeton
Assignee: Daniel Templeton
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-6714) Refactor UncompressedSplitLineReader.fillBuffer()

2016-06-09 Thread Daniel Templeton (JIRA)
Daniel Templeton created MAPREDUCE-6714:
---

 Summary: Refactor UncompressedSplitLineReader.fillBuffer()
 Key: MAPREDUCE-6714
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6714
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 2.8.0
Reporter: Daniel Templeton


MAPREDUCE-6635 made this change:

{code}
-  maxBytesToRead = Math.min(maxBytesToRead,
-(int)(splitLength - totalBytesRead));
+  long leftBytesForSplit = splitLength - totalBytesRead;
+  // check if leftBytesForSplit exceed Integer.MAX_VALUE
+  if (leftBytesForSplit <= Integer.MAX_VALUE) {
+maxBytesToRead = Math.min(maxBytesToRead, (int)leftBytesForSplit);
+  }
{code}

The result is one more comparison than necessary and code that's a little 
convoluted.  The code can be simplified as:

{code}
  long leftBytesForSplit = splitLength - totalBytesRead;

  if (leftBytesForSplit < maxBytesToRead) {
maxBytesToRead = (int)leftBytesForSplit;
  }
{code}

The comparison will auto promote {{maxBytesToRead}}, making it safe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-6702) TestMiniMRChildTask.testTaskEnv and TestMiniMRChildTask.testTaskOldEnv are failing

2016-05-17 Thread Daniel Templeton (JIRA)
Daniel Templeton created MAPREDUCE-6702:
---

 Summary: TestMiniMRChildTask.testTaskEnv and 
TestMiniMRChildTask.testTaskOldEnv are failing
 Key: MAPREDUCE-6702
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6702
 Project: Hadoop Map/Reduce
  Issue Type: Test
  Components: client
Affects Versions: 3.0.0-alpha1
Reporter: Daniel Templeton






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-6632) Master.getMasterAddress() should be updated to use YARN-4629

2016-02-10 Thread Daniel Templeton (JIRA)
Daniel Templeton created MAPREDUCE-6632:
---

 Summary: Master.getMasterAddress() should be updated to use 
YARN-4629
 Key: MAPREDUCE-6632
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6632
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: applicationmaster
Reporter: Daniel Templeton
Assignee: Daniel Templeton
Priority: Minor


The new {{YarnClientUtil.getRmPrincipal()}} method can replace most of the 
{{Master.getMasterAddress()}} method and should to reduce redundancy and 
improve servicability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: aspire to contribute

2016-02-02 Thread Daniel Templeton
I think most of what you want to know can be found on the elusively 
named "How to Contribute to Hadoop" wiki page:


https://wiki.apache.org/hadoop/HowToContribute

Daniel

On 2/2/16 7:26 AM, aditya singh wrote:

Hi,
I am third yearite in college ,namely, BITS(India). I know java (including
multithreading and design patterns),c/c++ and basic programming. I am new
to open source and wish to contribute. Could someone kindly help as to how
and where to start.





[jira] [Created] (MAPREDUCE-6620) Jobs that did not start are shown as starting in 1969 in the JHS web UI

2016-01-28 Thread Daniel Templeton (JIRA)
Daniel Templeton created MAPREDUCE-6620:
---

 Summary: Jobs that did not start are shown as starting in 1969 in 
the JHS web UI
 Key: MAPREDUCE-6620
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6620
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobhistoryserver
Affects Versions: 2.7.2
Reporter: Daniel Templeton
Assignee: Daniel Templeton


If a job fails, its start time is stored as -1.  The RM UI correctly handles 
negative start times.  The JHS UI does not, blindly converting it into a date 
in 1969.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAPREDUCE-6575) TestMRJobs.setup() should use YarnConfiguration properties instead of bare strings

2015-12-16 Thread Daniel Templeton (JIRA)
Daniel Templeton created MAPREDUCE-6575:
---

 Summary: TestMRJobs.setup() should use YarnConfiguration 
properties instead of bare strings
 Key: MAPREDUCE-6575
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6575
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Daniel Templeton
Assignee: Daniel Templeton


YARN-5870 introduced the following line:

{code}
  conf.setInt("yarn.cluster.max-application-priority", 10);
{code}

It should instead be:

{code}
  conf.setInt(YarnConfiguration.MAX_CLUSTER_LEVEL_APPLICATION_PRIORITY, 10);
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)