from:"Arun C Murthy $JIRA$"

[jira] [Updated] (MAPREDUCE-5728) Check NPE for serializer/deserializer in MapTask

2014-11-30 Thread Arun C Murthy (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated MAPREDUCE-5728:
-
Fix Version/s: (was: 2.6.0)
   2.7.0

 Check NPE for serializer/deserializer in MapTask
 

 Key: MAPREDUCE-5728
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5728
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Affects Versions: 2.2.0
Reporter: Jerry He
Assignee: Jerry He
Priority: Minor
 Fix For: 2.4.0, 2.7.0

 Attachments: MAPREDUCE-5728-trunk.patch


 Currently we will get NPE if the serializer/deserializer is not configured 
 correctly.
 {code}
 14/01/14 11:52:35 INFO mapred.JobClient: Task Id : 
 attempt_201401072154_0027_m_02_2, Status : FAILED
 java.lang.NullPointerException
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:944)
 at 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:672)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:740)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:368)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
 at 
 java.security.AccessController.doPrivileged(AccessController.java:362)
 at javax.security.auth.Subject.doAs(Subject.java:573)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1502)
 at org.apache.hadoop.mapred.Child.main(Child.java:249)
 {code}
 serializationFactory.getSerializer and serializationFactory.getDeserializer 
 returns NULL in this case.
 Let's check NPE for serializer/deserializer in MapTask so that we don't get 
 meaningless NPE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6156) Fetcher - connect() doesn't handle connection refused correctly

2014-11-13 Thread Arun C Murthy (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14210145#comment-14210145
 ] 

Arun C Murthy commented on MAPREDUCE-6156:
--

[~jlowe] Thanks. I've merged this into branch-2.6.0 also for hadoop-2.6.0-rc1.

 Fetcher - connect() doesn't handle connection refused correctly 
 

 Key: MAPREDUCE-6156
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6156
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Sidharta Seethana
Assignee: Junping Du
Priority: Blocker
 Fix For: 2.6.0

 Attachments: MAPREDUCE-6156-v2.patch, MAPREDUCE-6156-v3.patch, 
 MAPREDUCE-6156.patch


 The connect() function in the fetcher assumes that whenever an IOException is 
 thrown, the amount of time passed equals connectionTimeout ( see code 
 snippet below ). This is incorrect. For example, in case the NM is down, an 
 ConnectException is thrown immediately - and the catch block assumes a minute 
 has passed when it is not the case.
 {code}
   if (connectionTimeout  0) {
   throw new IOException(Invalid timeout 
 + [timeout =  + connectionTimeout +  ms]);
 } else if (connectionTimeout  0) {
   unit = Math.min(UNIT_CONNECT_TIMEOUT, connectionTimeout);
 }
 // set the connect timeout to the unit-connect-timeout
 connection.setConnectTimeout(unit);
 while (true) {
   try {
 connection.connect();
 break;
   } catch (IOException ioe) {
 // update the total remaining connect-timeout
 connectionTimeout -= unit;
 // throw an exception if we have waited for timeout amount of time
 // note that the updated value if timeout is used here
 if (connectionTimeout == 0) {
   throw ioe;
 }
 // reset the connect timeout for the last try
 if (connectionTimeout  unit) {
   unit = connectionTimeout;
   // reset the connect time out for the final connect
   connection.setConnectTimeout(unit);
 }
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6083) Map/Reduce dangerously adds Guava @Beta class to CryptoUtils

2014-11-05 Thread Arun C Murthy (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14198360#comment-14198360
 ] 

Arun C Murthy commented on MAPREDUCE-6083:
--

[~ctubbsii] - Apologies for the late response. Unfortunately, we can't change 
guava versions in 2.6 since it would be incompatible. 

 Map/Reduce dangerously adds Guava @Beta class to CryptoUtils
 

 Key: MAPREDUCE-6083
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6083
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Christopher Tubbs
Priority: Blocker
  Labels: beta, deprecated, guava
 Attachments: 
 0001-MAPREDUCE-6083-Avoid-client-use-of-deprecated-LimitI.patch


 See HDFS-7040 for more background/details.
 In recent 2.6.0-SNAPSHOTs, the use of LimitInputStream was added to 
 CryptoUtils. This is part of the API components of Hadoop, which severely 
 impacts users who were utilizing newer versions of Guava, where the @Beta and 
 @Deprecated class, LimitInputStream, has been removed (removed in version 15 
 and later), beyond the impact already experienced in 2.4.0 as identified in 
 HDFS-7040.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-6083) Map/Reduce dangerously adds Guava @Beta class to CryptoUtils

2014-11-05 Thread Arun C Murthy (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated MAPREDUCE-6083:
-
Status: Open  (was: Patch Available)

Yes, I'm cool with that. Tx!

 Map/Reduce dangerously adds Guava @Beta class to CryptoUtils
 

 Key: MAPREDUCE-6083
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6083
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Christopher Tubbs
Priority: Blocker
  Labels: beta, deprecated, guava
 Attachments: 
 0001-MAPREDUCE-6083-Avoid-client-use-of-deprecated-LimitI.patch


 See HDFS-7040 for more background/details.
 In recent 2.6.0-SNAPSHOTs, the use of LimitInputStream was added to 
 CryptoUtils. This is part of the API components of Hadoop, which severely 
 impacts users who were utilizing newer versions of Guava, where the @Beta and 
 @Deprecated class, LimitInputStream, has been removed (removed in version 15 
 and later), beyond the impact already experienced in 2.4.0 as identified in 
 HDFS-7040.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-2841) Task level native optimization

2014-06-24 Thread Arun C Murthy (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041754#comment-14041754
 ] 

Arun C Murthy commented on MAPREDUCE-2841:
--

I also noticed that the github has a bunch of code related to Pig, Hive etc. - 
I think we'd all agree that they need to be in respective projects eventually.

 Task level native optimization
 --

 Key: MAPREDUCE-2841
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
 Environment: x86-64 Linux/Unix
Reporter: Binglin Chang
Assignee: Sean Zhong
 Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
 MAPREDUCE-2841.v2.patch, dualpivot-0.patch, dualpivotv20-0.patch, 
 fb-shuffle.patch


 I'm recently working on native optimization for MapTask based on JNI. 
 The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
 emitted by mapper, therefore sort, spill, IFile serialization can all be done 
 in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
 results:
 1. Sort is about 3x-10x as fast as java(only binary string compare is 
 supported)
 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
 CRC32C is used, things can get much faster(1G/
 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
 prevent mid-spill
 This leads to a total speed up of 2x~3x for the whole MapTask, if 
 IdentityMapper(mapper does nothing) is used
 There are limitations of course, currently only Text and BytesWritable is 
 supported, and I have not think through many things right now, such as how to 
 support map side combine. I had some discussion with somebody familiar with 
 hive, it seems that these limitations won't be much problem for Hive to 
 benefit from those optimizations, at least. Advices or discussions about 
 improving compatibility are most welcome:) 
 Currently NativeMapOutputCollector has a static method called canEnable(), 
 which checks if key/value type, comparator type, combiner are all compatible, 
 then MapTask can choose to enable NativeMapOutputCollector.
 This is only a preliminary test, more work need to be done. I expect better 
 final results, and I believe similar optimization can be adopt to reduce task 
 and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-2841) Task level native optimization

2014-06-24 Thread Arun C Murthy (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041752#comment-14041752
]

Arun C Murthy commented on MAPREDUCE-2841:
--

{quote}
If the MR developer community generally agrees this belongs in the core, I'd
like to start a feature branch for it in order to import the current code, sort
out the build/integration issues, and take care of the remaining items that
Sean mentioned above.
{quote}

[~tlipcon] Thanks for starting this discussion. I have a few thoughts I'd like
to run by you.

I think the eventual goal of this (looking at
https://github.com/intel-hadoop/nativetask/blob/master/README.md) is a
full-native runtime for MapReduce including sort, shuffle, merge etc.

Hence, it does look like we will achieve a compatible, but alternate
implementation of MapReduce runtime. Hence, this is similar to other alternate
runtimes for MapReduce such as Apache Tez.

Furthermore, this is implemented in C++ - which is, frankly, a concern for the
poor job C++ has done with ABI. I'm glad to see that it doesn't rely on boost -
the worst affender. This is the same reason the native Hadoop client
(HADOOP-10388) is being done purely in C. Also, the MR development community is
pre-dominantly Java, which is something to keep in mind. This is a big concern
for me.

In all, it seems to me we could consider having this not in Apache Hadoop, but
as an incubator project to develop a native, MR compatible runtime.

This will allow it to develop a like-minded community (C++ skills etc.) and not
be bogged down by *all* of Hadoop's requirements such as security (how/when
will this allow for secure shuffle or encrypted shuffle etc.), compatibility
with several OSes (flavours of Linux, MacOSX, Windows) etc. It will also allow
them to ship independently and get user feedback more quickly.

Similarly, I am wary of importing a nearly 75K LOC codebase into a stable
project and it's impact on our releases on breakage - particularly given the
difference in skills of the community i.e. Java v/s C++ etc.

What do you think Todd Sean? I'm more than happy to help with incubator
process if required.

Task level native optimization
--

Key: MAPREDUCE-2841
URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
Project: Hadoop Map/Reduce
Issue Type: Improvement
Components: task
Environment: x86-64 Linux/Unix
Reporter: Binglin Chang
Assignee: Sean Zhong
Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch,
MAPREDUCE-2841.v2.patch, dualpivot-0.patch, dualpivotv20-0.patch,
fb-shuffle.patch

I'm recently working on native optimization for MapTask based on JNI.
The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs
emitted by mapper, therefore sort, spill, IFile serialization can all be done
in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising
results:
1. Sort is about 3x-10x as fast as java(only binary string compare is
supported)
2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware
CRC32C is used, things can get much faster(1G/
3. Merge code is not completed yet, so the test use enough io.sort.mb to
prevent mid-spill
This leads to a total speed up of 2x~3x for the whole MapTask, if
IdentityMapper(mapper does nothing) is used
There are limitations of course, currently only Text and BytesWritable is
supported, and I have not think through many things right now, such as how to
support map side combine. I had some discussion with somebody familiar with
hive, it seems that these limitations won't be much problem for Hive to
benefit from those optimizations, at least. Advices or discussions about
improving compatibility are most welcome:)
Currently NativeMapOutputCollector has a static method called canEnable(),
which checks if key/value type, comparator type, combiner are all compatible,
then MapTask can choose to enable NativeMapOutputCollector.
This is only a preliminary test, more work need to be done. I expect better
final results, and I believe similar optimization can be adopt to reduce task
and shuffle too.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-2841) Task level native optimization

2014-06-24 Thread Arun C Murthy (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041757#comment-14041757
]

Arun C Murthy commented on MAPREDUCE-2841:
--

On related thought to Pig/Hive etc. - I see Hadoop MapReduce fading away fast
particularly since projects using MR such as Pig, Hive, Cascading etc.
re-vector on other projects like Apache Tez or Apache Spark.

For e.g.
# Hive-on-Tez (https://issues.apache.org/jira/browse/HIVE-4660) - The hive
community has already moved it's major investments away from MR to Tez.
# Pig-on-Tez (https://issues.apache.org/jira/browse/PIG-3446) - The pig
community is very close to shipping this in pig-0.14 and again is investing
heavily on Tez.

Given that, Sean/Todd, would it be useful to discuss contributing this to Tez
instead?

This way the work here would continue to stay relevant in the context of the
majority users of MapReduce who use Pig, Hive, Cascading etc.

Of course, I'm sure another option is Apache Spark, but given that Tez is much
more closer (code-base wise) to MR, it would be much easier to contribute to
Tez. Happy to help if that makes sense too. Thanks.

Task level native optimization
--

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-2841) Task level native optimization

2014-06-24 Thread Arun C Murthy (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041816#comment-14041816
]

Arun C Murthy commented on MAPREDUCE-2841:
--

Todd,

bq. I agree that building a completely parallel C++ MR runtime is a much larger
project that should not be part of Hadoop.

I'm confused. There already exists large amounts of code on the github for a
the full task runtime. Is that abandoned? Are you saying there no intention to
contribute that to Hadoop, ever? Why would that be? Would that be a separate
project?

With or without ABI, C++ still is a major problem w.r.t different compiler
versions, different platforms we support etc. That is precisely why
HADOOP-10388 chose to use pure-C only. A similar switch makes me *much* more
comfortable, aside from the disparity in skills in the Hadoop community.

Furthermore, there are considerably more security issues which open up in C++
land such as buffer overflow etc.

bq. I think the 75k you're counting may include the auto-generated shell
scripts.

From the github:

{noformat}
$ find . -name *.java | xargs wc -l
11988 total
$ find . -name *.h | xargs wc -l
27269 total
$ find . -name *.cc | xargs wc -l
26276 total
{noformat}

Whether it's test or non-test, we are still importing a *lot* of code - code
for which the Hadoop community does need to maintain?

bq. So, it's not a tiny import by any means, but for 2x improvement on terasort
wallclock, my opinion is that the maintenance burden is worth it.

Todd, as we both know, there are many, many ways to get 2x improvement on
terasort...
... nor is it worth a lot in real-world outside of benchmarks.

I'm sure we both would take 2x on Pig/Hive anyday... *smile*

bq. As for importing to Tez, I don't think the community has generally agreed
to EOL MapReduce

Regardless of whether or not we pull this into MR, it would be useful to pull
it into Tez too - if Sean wants to do it. Let's not discourage them.

I'm sure we both agree, and want to see real world workloads improve and that
Hive/Pig/Cascading etc. represent that.

IAC, hopefully we can stop this meme that I'm trying to *preclude* you from
doing anything regardless of my religious beliefs. IAC, we both realize MR is
reasonably stable and won't get a lot of investment, and so do our employers:
http://vision.cloudera.com/mapreduce-spark/
http://hortonworks.com/hadoop/tez/

So, I'd appreciate if we don't misinterpret each others' technical opinions and
concerns during this discussion. Thanks.

FTR: I'll restate my concerns about C++, roadmap for C++ runtime,
maintainability, support for all of Hadoop (security, platforms etc.).

Furthermore, this jira was opened nearly 3 years ago and only has sporadic
bursts of activity - not a good sign for long-term maintainability.

I've stated my concerns, let's try get through them by focussing on those
aspects.

Finally, what is the concern you see with starting this as an incubator project
and allowing folks to develop a community around it? We can certainly help on
our end by making it easy for them to plug in via interfaces etc.

Thanks.

Task level native optimization
--

[jira] [Comment Edited] (MAPREDUCE-2841) Task level native optimization

2014-06-24 Thread Arun C Murthy (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041816#comment-14041816
]

Arun C Murthy edited comment on MAPREDUCE-2841 at 6/24/14 8:30 AM:
---