Re: [DISCUSS] About creation of Hadoop Thirdparty repository for shaded artifacts

2019-09-27 Thread Owen O'Malley
I'm very unhappy with this direction. In particular, I don't think git is a
good place for distribution of binary artifacts. Furthermore, the PMC
shouldn't be releasing anything without a release vote.

I'd propose that we make a third party module that contains the *source* of
the pom files to build the relocated jars. This should absolutely be
treated as a last resort for the mostly Google projects that regularly
break binary compatibility (eg. Protobuf & Guava).

In terms of naming, I'd propose something like:

org.apache.hadoop.thirdparty.protobuf2_5
org.apache.hadoop.thirdparty.guava28

In particular, I think we absolutely need to include the version of the
underlying project. On the other hand, since we should not be shading
*everything* we can drop the leading com.google.

The Hadoop project can make releases of  the thirdparty module:


  org.apache.hadoop
  hadoop-thirdparty-protobuf25
  1.0


Note that the version has to be the hadoop thirdparty release number, which
is part of why you need to have the underlying version in the artifact
name. These we can push to maven central as new releases from Hadoop.

Thoughts?

.. Owen

On Fri, Sep 27, 2019 at 8:38 AM Vinayakumar B 
wrote:

> Hi All,
>
>I wanted to discuss about the separate repo for thirdparty dependencies
> which we need to shaded and include in Hadoop component's jars.
>
>Apologies for the big text ahead, but this needs clear explanation!!
>
>Right now most needed such dependency is protobuf. Protobuf dependency
> was not upgraded from 2.5.0 onwards with the fear that downstream builds,
> which depends on transitive dependency protobuf coming from hadoop's jars,
> may fail with the upgrade. Apparently protobuf does not guarantee source
> compatibility, though it guarantees wire compatibility between versions.
> Because of this behavior, version upgrade may cause breakage in known and
> unknown (private?) downstreams.
>
>So to tackle this, we came up the following proposal in HADOOP-13363.
>
>Luckily, As far as I know, no APIs, either public to user or between
> Hadoop processes, is not directly using protobuf classes in signatures. (If
> any exist, please let us know).
>
>Proposal:
>
>
>1. Create a artifact(s) which contains shaded dependencies. All such
> shading/relocation will be with known prefix
> **org.apache.hadoop.thirdparty.**.
>2. Right now protobuf jar (ex: o.a.h.thirdparty:hadoop-shaded-protobuf)
> to start with, all **com.google.protobuf** classes will be relocated as
> **org.apache.hadoop.thirdparty.com.google.protobuf**.
>3. Hadoop modules, which needs protobuf as dependency, will add this
> shaded artifact as dependency (ex:
> o.a.h.thirdparty:hadoop-shaded-protobuf).
>4. All previous usages of "com.google.protobuf" will be relocated to
> "org.apache.hadoop.thirdparty.com.google.protobuf" in the code and will be
> committed. Please note, this replacement is One-Time directly in source
> code, NOT during compile and package.
>5. Once all usages of "com.google.protobuf" is relocated, then hadoop
> dont care about which version of original  "protobuf-java" is in
> dependency.
>6. Just keep "protobuf-java:2.5.0" in dependency tree not to break the
> downstreams. But hadoop will be originally using the latest protobuf
> present in "o.a.h.thirdparty:hadoop-shaded-protobuf".
>
>7. Coming back to separate repo, Following are most appropriate reasons
> of keeping shaded dependency artifact in separate repo instead of
> submodule.
>
>   7a. These artifacts need not be built all the time. It needs to be
> built only when there is a change in the dependency version or the build
> process.
>   7b. If added as "submodule in Hadoop repo", maven-shade-plugin:shade
> will execute only in package phase. That means, "mvn compile" or "mvn
> test-compile" will not be failed as this artifact will not have relocated
> classes, instead it will have original classes, resulting in compilation
> failure. Workaround, build thirdparty submodule first and exclude
> "thirdparty" submodule in other executions. This will be a complex process
> compared to keeping in a separate repo.
>
>   7c. Separate repo, will be a subproject of Hadoop, using the same
> HADOOP jira project, with different versioning prefixed with "thirdparty-"
> (ex: thirdparty-1.0.0).
>   7d. Separate will have same release process as Hadoop.
>
>
> HADOOP-13363 (https://issues.apache.org/jira/browse/HADOOP-13363) is
> an
> umbrella jira tracking the changes to protobuf upgrade.
>
> PR (https://github.com/apache/hadoop-thirdparty/pull/1) has been
> raised
> for separate repo creation in (HADOOP-16595 (
> https://issues.apache.org/jira/browse/HADOOP-16595)
>
> Please provide your inputs for the proposal and review the PR to
> proceed with the proposal.
>
>
>-Thanks,
> Vinay
>
> On Fri, Sep 27, 2019 at 11:54 AM Vinod Kumar Vavilapalli <
> vino...@apache.org>
> wrote:
>
> > Moving the 

Re: [VOTE] Moving Submarine to a separate Apache project proposal

2019-09-06 Thread Owen O'Malley
Since you don't have any Apache Members, I'll join to provide Apache
oversight.

.. Owen

On Fri, Sep 6, 2019 at 1:38 PM Owen O'Malley  wrote:

> +1 for moving to a new project.
>
> On Sat, Aug 31, 2019 at 10:19 PM Wangda Tan  wrote:
>
>> Hi all,
>>
>> As we discussed in the previous thread [1],
>>
>> I just moved the spin-off proposal to CWIKI and completed all TODO parts.
>>
>>
>> https://cwiki.apache.org/confluence/display/HADOOP/Submarine+Project+Spin-Off+to+TLP+Proposal
>>
>> If you have interests to learn more about this. Please review the proposal
>> let me know if you have any questions/suggestions for the proposal. This
>> will be sent to board post voting passed. (And please note that the
>> previous voting thread [2] to move Submarine to a separate Github repo is
>> a
>> necessary effort to move Submarine to a separate Apache project but not
>> sufficient so I sent two separate voting thread.)
>>
>> Please let me know if I missed anyone in the proposal, and reply if you'd
>> like to be included in the project.
>>
>> This voting runs for 7 days and will be concluded at Sep 7th, 11 PM PDT.
>>
>> Thanks,
>> Wangda Tan
>>
>> [1]
>>
>> https://lists.apache.org/thread.html/4a2210d567cbc05af92c12aa6283fd09b857ce209d537986ed800029@%3Cyarn-dev.hadoop.apache.org%3E
>> [2]
>>
>> https://lists.apache.org/thread.html/6e94469ca105d5a15dc63903a541bd21c7ef70b8bcff475a16b5ed73@%3Cyarn-dev.hadoop.apache.org%3E
>>
>


Re: [VOTE] Moving Submarine to a separate Apache project proposal

2019-09-06 Thread Owen O'Malley
+1 for moving to a new project.

On Sat, Aug 31, 2019 at 10:19 PM Wangda Tan  wrote:

> Hi all,
>
> As we discussed in the previous thread [1],
>
> I just moved the spin-off proposal to CWIKI and completed all TODO parts.
>
>
> https://cwiki.apache.org/confluence/display/HADOOP/Submarine+Project+Spin-Off+to+TLP+Proposal
>
> If you have interests to learn more about this. Please review the proposal
> let me know if you have any questions/suggestions for the proposal. This
> will be sent to board post voting passed. (And please note that the
> previous voting thread [2] to move Submarine to a separate Github repo is a
> necessary effort to move Submarine to a separate Apache project but not
> sufficient so I sent two separate voting thread.)
>
> Please let me know if I missed anyone in the proposal, and reply if you'd
> like to be included in the project.
>
> This voting runs for 7 days and will be concluded at Sep 7th, 11 PM PDT.
>
> Thanks,
> Wangda Tan
>
> [1]
>
> https://lists.apache.org/thread.html/4a2210d567cbc05af92c12aa6283fd09b857ce209d537986ed800029@%3Cyarn-dev.hadoop.apache.org%3E
> [2]
>
> https://lists.apache.org/thread.html/6e94469ca105d5a15dc63903a541bd21c7ef70b8bcff475a16b5ed73@%3Cyarn-dev.hadoop.apache.org%3E
>


Re: [VOTE] Merging branch HDFS-7240 to trunk

2018-03-14 Thread Owen O'Malley
This discussion seems to have died down coming closer consensus without a
resolution.

I'd like to propose the following compromise:

* HDSL become a subproject of Hadoop.
* HDSL will release separately from Hadoop. Hadoop releases will not
contain HDSL and vice versa.
* HDSL will get its own jira instance so that the release tags stay
separate.
* On trunk (as opposed to release branches) HDSL will be a separate module
in Hadoop's source tree. This will enable the HDSL to work on their trunk
and the Hadoop trunk without making releases for every change.
* Hadoop's trunk will only build HDSL if a non-default profile is enabled.
* When Hadoop creates a release branch, the RM will delete the HDSL module
from the branch.
* HDSL will have their own Yetus checks and won't cause failures in the
Hadoop patch check.

I think this accomplishes most of the goals of encouraging HDSL development
while minimizing the potential for disruption of HDFS development.

Thoughts? Andrew, Jitendra, & Sanjay?

Thanks,
   Owen


Re: [VOTE] Merging branch HDFS-7240 to trunk

2018-03-09 Thread Owen O'Malley
Hi Joep,

On Tue, Mar 6, 2018 at 6:50 PM, J. Rottinghuis 
wrote:

Obviously when people do want to use Ozone, then having it in the same repo
> is easier. The flipside is that, separate top-level project in the same
> repo or not, it adds to the Hadoop releases.
>

Apache projects are about the group of people who are working together.
There is a large overlap between the team working on HDFS and Ozone, which
is a lot of the motivation to keep project overhead to a minimum and not
start a new project.

Using the same releases or separate releases is a distinct choice. Many
Apache projects, such as Common and Maven, have multiple artifacts that
release independently. In Hive, we have two sub-projects that release
indepdendently: Hive Storage API, and Hive.

One thing we did during that split to minimize the challenges to the
developers was that Storage API and Hive have the same master branch.
However, since they have different releases, they have their own release
branches and release numbers.

If there is a change in Ozone and a new release needed, it would have to
> wait for a Hadoop release. Ditto if there is a Hadoop release and there is
> an issue with Ozone. The case that one could turn off Ozone through a Maven
> profile works only to some extend.
> If we have done a 3.x release with Ozone in it, would it make sense to do
> a 3.y release with y>x without Ozone in it? That would be weird.
>

Actually, if Ozone is marked as unstable/evolving (we should actually have
an even stronger warning for a feature preview), we could remove it in a
3.x. If a user picks up a feature before it is stable, we try to provide a
stable platform, but mistakes happen. Introducing an incompatible change to
the Ozone API between 3.1 and 3.2 wouldn't be good, but it wouldn't be the
end of the world.

.. Owen


Re: [VOTE] Merging branch HDFS-7240 to trunk

2018-03-02 Thread Owen O'Malley
On Thu, Mar 1, 2018 at 11:03 PM, Andrew Wang 
wrote:

Owen mentioned making a Hadoop subproject; we'd have to
> hash out what exactly this means (I assume a separate repo still managed by
> the Hadoop project), but I think we could make this work if it's more
> attractive than incubation or a new TLP.


Ok, there are multiple levels of sub-projects that all make sense:

   - Same source tree, same releases - examples like HDFS & YARN
   - Same master branch, separate releases and release branches - Hive's
   Storage API vs Hive. It is in the source tree for the master branch, but
   has distinct releases and release branches.
   - Separate source, separate release - Apache Commons.

There are advantages and disadvantages to each. I'd propose that we use the
same source, same release pattern for Ozone. Note that we tried and later
reverted doing Common, HDFS, and YARN as separate source, separate release
because it was too much trouble. I like Daryn's idea of putting it as a top
level directory in Hadoop and making sure that nothing in Common, HDFS, or
YARN depend on it. That way if a Release Manager doesn't think it is ready
for release, it can be trivially removed before the release.

One thing about using the same releases, Sanjay and Jitendra are signing up
to make much more regular bugfix and minor releases in the near future. For
example, they'll need to make 3.2 relatively soon to get it released and
then 3.3 somewhere in the next 3 to 6 months. That would be good for the
project. Hadoop needs more regular releases and fewer big bang releases.

.. Owen


Re: [VOTE] Merging branch HDFS-7240 to trunk

2018-03-01 Thread Owen O'Malley
I think it would be good to get this in sooner rather than later, but I
have some thoughts.

   1. It is hard to tell what has changed. git rebase -i tells me the
   branch has 722 commits. The rebase failed with a conflict. It would really
   help if you rebased to current trunk.
   2. I think Ozone would be a good Hadoop subproject, but it should be
   outside of HDFS.
   3. CBlock, which is also coming in this merge, would benefit from more
   separation from HDFS.
   4. What are the new transitive dependencies that Ozone, HDSL, and CBlock
   adding to the clients? The servers matter too, but the client dependencies
   have a huge impact on our users.
   5. Have you checked the new dependencies for compatibility with ASL?


On Thu, Mar 1, 2018 at 2:45 PM, Clay B.  wrote:

> Oops, retrying now subscribed to more than solely yarn-dev.
>
> -Clay
>
>
> On Wed, 28 Feb 2018, Clay B. wrote:
>
> +1 (non-binding)
>>
>> I have walked through the code and find it very compelling as a user; I
>> really look forward to seeing the Ozone code mature and it maturing HDFS
>> features together. The points which excite me as an eight year HDFS user
>> are:
>>
>> * Excitement for making the datanode a storage technology container - this
>>  patch clearly brings fresh thought to HDFS keeping it from growing stale
>>
>> * Ability to build upon a shared storage infrastructure for diverse
>>  loads: I do not want to have "stranded" storage capacity or have to
>>  manage competing storage systems on the same disks (and further I want
>>  the metrics datanodes can provide me today, so I do not have to
>>  instrument two systems or evolve their instrumentation separately).
>>
>> * Looking forward to supporting object-sized files!
>>
>> * Moves HDFS in the right direction to test out new block management
>>  techniques for scaling HDFS. I am really excited to see the raft
>>  integration; I hope it opens a new era in Hadoop matching modern systems
>>  design with new consistency and replication options in our ever
>>  distributed ecosystem.
>>
>> -Clay
>>
>> On Mon, 26 Feb 2018, Jitendra Pandey wrote:
>>
>>Dear folks,
>>>   We would like to start a vote to merge HDFS-7240 branch into
>>> trunk. The context can be reviewed in the DISCUSSION thread, and in the
>>> jiras (See references below).
>>>
>>>HDFS-7240 introduces Hadoop Distributed Storage Layer (HDSL), which
>>> is a distributed, replicated block layer.
>>>The old HDFS namespace and NN can be connected to this new block
>>> layer as we have described in HDFS-10419.
>>>We also introduce a key-value namespace called Ozone built on HDSL.
>>>
>>>The code is in a separate module and is turned off by default. In a
>>> secure setup, HDSL and Ozone daemons cannot be started.
>>>
>>>The detailed documentation is available at
>>> https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+
>>> Distributed+Storage+Layer+and+Applications
>>>
>>>
>>>I will start with my vote.
>>>+1 (binding)
>>>
>>>
>>>Discussion Thread:
>>>  https://s.apache.org/7240-merge
>>>  https://s.apache.org/4sfU
>>>
>>>Jiras:
>>>   https://issues.apache.org/jira/browse/HDFS-7240
>>>   https://issues.apache.org/jira/browse/HDFS-10419
>>>   https://issues.apache.org/jira/browse/HDFS-13074
>>>   https://issues.apache.org/jira/browse/HDFS-13180
>>>
>>>
>>>Thanks
>>>jitendra
>>>
>>>
>>>
>>>
>>>
>>>DISCUSSION THREAD SUMMARY :
>>>
>>>On 2/13/18, 6:28 PM, "sanjay Radia" 
>>> wrote:
>>>
>>>Sorry the formatting got messed by my email client.  Here
>>> it is again
>>>
>>>
>>>Dear
>>> Hadoop Community Members,
>>>
>>>   We had multiple community discussions, a few meetings
>>> in smaller groups and also jira discussions with respect to this thread. We
>>> express our gratitude for participation and valuable comments.
>>>
>>>The key questions raised were following
>>>1) How the new block storage layer and OzoneFS benefit
>>> HDFS and we were asked to chalk out a roadmap towards the goal of a
>>> scalable namenode working with the new storage layer
>>>2) We were asked to provide a security design
>>>3)There were questions around stability given ozone
>>> brings in a large body of code.
>>>4) Why can?t they be separate projects forever or merged
>>> in when production ready?
>>>
>>>We have responded to all the above questions with
>>> detailed explanations and answers on the jira as well as in the
>>> discussions. We believe that should sufficiently address community?s
>>> concerns.
>>>
>>>Please see the summary below:
>>>
>>>1) The new code base benefits HDFS scaling and a roadmap
>>> has been provided.
>>>
>>>

Re: Shuffler logic implementation

2017-03-14 Thread Owen O'Malley
That is under your application's control. Define a class that implements
Partitioner
https://hadoop.apache.org/docs/r2.6.3/api/org/apache/hadoop/mapreduce/Partitioner.html
and set the name of the class in your job's configuration using
job.setPartitionerClass(...).

.. Owen

On Tue, Mar 14, 2017 at 2:51 PM, Pushparaj Motamari 
wrote:

> Hi,
>
> I want to understand the implementation in the code which assigns
> particular reducer with particular keys.I mean, the code which provides the
> logic of assigning reducers with a particular key, where Mappers will send
> their key,value pairs after mapping.  Will it assign based on
> hash(key)%(number of reducers) ?
>
> Regards
>
> Pushparaj
>


Re: Moving to JDK7, JDK8 and new major releases

2014-06-25 Thread Owen O'Malley
On Tue, Jun 24, 2014 at 4:44 PM, Alejandro Abdelnur t...@cloudera.com
wrote:

 After reading this thread and thinking a bit about it, I think it should be
 OK such move up to JDK7 in Hadoop


I agree with Alejandro. Changing minimum JDKs is not an incompatible change
and is fine in the 2 branch. (Although I think it is would *not* be
appropriate for a patch release.) Of course we need to do it with
forethought and testing, but moving off of JDK 6, which is EOL'ed is a good
thing. Moving to Java 8 as a minimum seems much too aggressive and I would
push back on that.

I'm also think that we need to let the dust settle on the Hadoop 2 line for
a while before we talk about Hadoop 3. It seems that it has only been in
the last 6 months that Hadoop 2 adoption has reached the main stream users.
Our user community needs time to digest the changes in Hadoop 2.x before we
fracture the community by starting to discuss Hadoop 3 releases.

.. Owen


[jira] [Created] (MAPREDUCE-5490) MapReduce doesn't set the environment variable for children processes

2013-08-30 Thread Owen O'Malley (JIRA)
Owen O'Malley created MAPREDUCE-5490:


 Summary: MapReduce doesn't set the environment variable for 
children processes
 Key: MAPREDUCE-5490
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5490
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 1.2.1
Reporter: Owen O'Malley
Assignee: Owen O'Malley


Currently, MapReduce uses the command line argument to pass the classpath to 
the child. This breaks if the process forks a child that needs the same 
classpath. Such a case happens in Hive when it uses map-side joins. I propose 
that we make MapReduce in branch-1 use the CLASSPATH environment variable like 
YARN does.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAPREDUCE-5202) Revert MAPREDUCE-4397 to avoid using incorrect config files

2013-05-01 Thread Owen O'Malley (JIRA)
Owen O'Malley created MAPREDUCE-5202:


 Summary: Revert MAPREDUCE-4397 to avoid using incorrect config 
files
 Key: MAPREDUCE-5202
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5202
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Owen O'Malley
Assignee: Owen O'Malley


MAPREDUCE-4397 added the capability to switch the location of the 
taskcontroller.cfg file, which weakens security.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAPREDUCE-5202) Revert MAPREDUCE-4397 to avoid using incorrect config files

2013-05-01 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved MAPREDUCE-5202.
--

Resolution: Fixed

I reverted the previous patch on branch-1, branch-1.1, and branch-1.2.

 Revert MAPREDUCE-4397 to avoid using incorrect config files
 ---

 Key: MAPREDUCE-5202
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5202
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Owen O'Malley
Assignee: Owen O'Malley

 MAPREDUCE-4397 added the capability to switch the location of the 
 taskcontroller.cfg file, which weakens security.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Release numbering for branch-2 releases

2013-02-04 Thread Owen O'Malley
I think that using -(alpha,beta) tags on the release versions is a really
bad idea. All releases should follow the strictly numeric
(Major.Minor.Patch) pattern that we've used for all of the releases except
the 2.0.x ones.

-- Owen


On Mon, Feb 4, 2013 at 11:53 AM, Stack st...@duboce.net wrote:

 On Mon, Feb 4, 2013 at 10:46 AM, Arun C Murthy a...@hortonworks.com
 wrote:

  Would it better to have 2.0.3-alpha, 2.0.4-beta and then make 2.1 as a
  stable release? This way we just have one series (2.0.x) which is not
  suitable for general consumption.
 
 

 That contains the versioning damage to the 2.0.x set.  This is an
 improvement over the original proposal where we let the versioning mayhem
 run out 2.3.

 Thanks Arun,
 St.Ack



[jira] [Created] (MAPREDUCE-4601) Windows CMD processor doesn't use double quotes

2012-08-28 Thread Owen O'Malley (JIRA)
Owen O'Malley created MAPREDUCE-4601:


 Summary: Windows CMD processor doesn't use double quotes
 Key: MAPREDUCE-4601
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4601
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 1-win
Reporter: Owen O'Malley
Assignee: Owen O'Malley


Currently, the task launch script under windows matches Linux and double quotes 
all of the values. Unfortunately, the Windows' CMD processor doesn't need and 
doesn't ignore the double quotes. The main symptom of this is that the 
CLASSPATH loses the first and last entries.

{code}
set CLASSPATH=c:\foo;c:\bar;c:\baz
{code}

results in having 'c:\foo' 'c:\bar' and 'c:\baz' on the classpath. Of those 
three, only the 'c:\bar' is valid. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAPREDUCE-4505) Create a combiner bypass path for keys with a single value

2012-08-01 Thread Owen O'Malley (JIRA)
Owen O'Malley created MAPREDUCE-4505:


 Summary: Create a combiner bypass path for keys with a single value
 Key: MAPREDUCE-4505
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4505
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: performance, task
Reporter: Owen O'Malley


It would help optimize a lot of cases where there aren't a lot of replicated 
keys if the framework would bypass the deserialize/combiner/serialize step for 
keys that only have a single value.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (MAPREDUCE-4232) Make the distributed cache tests easier to diagnose

2012-05-08 Thread Owen O'Malley (JIRA)
Owen O'Malley created MAPREDUCE-4232:


 Summary: Make the distributed cache tests easier to diagnose
 Key: MAPREDUCE-4232
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4232
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distributed-cache, test
Reporter: Owen O'Malley
Assignee: Owen O'Malley


We currently require that the test environment:

* Have umask of 0022.
* Have a world readable basedir (including parents)

It would be good to check for those before bothering to run tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Reduce output is strange

2012-04-03 Thread Owen O'Malley
On Tue, Apr 3, 2012 at 8:01 AM, Pedro Costa psdc1...@gmail.com wrote:
 If I want to compare 2 sequence files to see if they are the same, how do I
 compare?

From the command line, you can textify the files with:

hadoop fs -text myfile.seq

Of course, if you are using API you can iterate through the two
Sequence files and compare them row by row.

-- Owen


Re: Reduce output is strange

2012-04-03 Thread Owen O'Malley
On Tue, Apr 3, 2012 at 8:25 AM, Pedro Costa psdc1...@gmail.com wrote:
 What I want to ask is:

 - how do I read the values from sequence files that are block, or record
 compressed, or uncompressed?

You use the SequenceFile.Reader class.

 - how do I know if the sequence file is block compressed, record
 compressed, or uncompressed?

You use the SequenceFile.Reader class.


 - how do I know if it's a sequence file or a Textfile?

SequenceFile's always have SEQ followed by the version in the first 4 bytes.

-- Owen


Re: [RESULT] - [VOTE] Rename hadoop branches post hadoop-1.x

2012-03-29 Thread Owen O'Malley
On Wed, Mar 28, 2012 at 5:11 PM, Doug Cutting cutt...@apache.org wrote:

 On 03/28/2012 12:39 PM, Owen O'Malley wrote:
  [ ... ] So the RM of the 2 branch needs to make the call of what
  should be 2.1 vs 3.0.

 I thought these were community decisions, not RM decisions, no?


 What to release is the RM's decision and then voted on by the community.
We tried voting on which features to include and it led to no releases for
two years. I think our users are better served by having good usable
releases.

-- Owen


Re: [RESULT] - [VOTE] Rename hadoop branches post hadoop-1.x

2012-03-28 Thread Owen O'Malley
I disagree. Trunk should become branch-3 once someone wants to start
stabilizing it. Arun is going to need the minor versions for when he adds
features.

X.Y.Z

Z = bug fixes
Y = minor release (compatible, adds features)
X = major release (incompatible)

So from branch-2 will come branch-2.0 with tags for 2.0.0, 2.0.1. New
features will go into branch-2, which will become branch-2.1, branch-2.2,
and so on.

-- Owen


Re: [RESULT] - [VOTE] Rename hadoop branches post hadoop-1.x

2012-03-28 Thread Owen O'Malley
On Wed, Mar 28, 2012 at 12:32 PM, Todd Lipcon t...@cloudera.com wrote:

But new features also go to trunk. And if none of our new features are
 incompatible, why do we anticipate that trunk is 3.0?


Let's imagine that we already had a 2.0.0 release. Now we want to add
features like HA. The only place to put that is in 2.1.0. On the other
hand, you don't want to pull *ALL* of the changes from trunk. That is way
too much scope. So the RM of the 2 branch needs to make the call of what
should be 2.1 vs 3.0.

-- Owen


Re: svn commit: r1304067 - in /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project: ./ bin/ conf/ hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/ hadoop-mapreduce-exam

2012-03-22 Thread Owen O'Malley
To me, I'd much much rather have the human readable description of what is
being fixed and I mostly could care less about which subversion commit it
corresponds to. I'd be all for using the CHANGE.txt description as the
commit message for both trunk and the branches.

-- Owen


[jira] [Created] (MAPREDUCE-3773) Add queue metrics with buckets for job run times

2012-01-31 Thread Owen O'Malley (Created) (JIRA)
Add queue metrics with buckets for job run times


 Key: MAPREDUCE-3773
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3773
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: jobtracker
Reporter: Owen O'Malley
Assignee: Owen O'Malley


It would be nice to have queue metrics that reflect the number of jobs in each 
queue that have been running for different ranges of time.

Reasonable time ranges are probably 0-1 hr, 1-5 hr, 5-24 hr, 24+ hrs; but they 
should be configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (MAPREDUCE-3495) Remove my personal email address from the pipes build file.

2011-12-01 Thread Owen O'Malley (Created) (JIRA)
Remove my personal email address from the pipes build file.
---

 Key: MAPREDUCE-3495
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3495
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: build
Reporter: Owen O'Malley
Assignee: Owen O'Malley


When I first wrote the pipes autoconf/automake stuff, I incorrectly put my 
email address in the AC_INIT line, which means if something goes wrong, you get:

{quote}
 configure: WARNING: ## Report this to my-email ##
{quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (MAPREDUCE-2977) ResourceManager needs to renew and cancel tokens associated with a job

2011-09-09 Thread Owen O'Malley (JIRA)
ResourceManager needs to renew and cancel tokens associated with a job
--

 Key: MAPREDUCE-2977
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2977
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 0.23.0
Reporter: Owen O'Malley
Priority: Blocker


The JobTracker currently manages tokens for the applications and the resource 
manager needs the same functionality.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (MAPREDUCE-2946) TaskTrackers fail at startup

2011-09-07 Thread Owen O'Malley (JIRA)
TaskTrackers fail at startup


 Key: MAPREDUCE-2946
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2946
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: tasktracker
Affects Versions: 0.20.205.0
Reporter: Owen O'Malley
 Fix For: 0.20.205.0


Upgrading from 0.20.204.0 to 0.20.205.0-SNAPSHOT, the TaskTrackers refused to 
start because the cleanup failed. I was able to start the task trackers by 
deleting the mapred localdirs across the cluster.

I was running with the linux task controller and security turned on.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (MAPREDUCE-2946) TaskTrackers fail at startup

2011-09-07 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved MAPREDUCE-2946.
--

Resolution: Invalid

I forgot to chmod the task-controller to setuid. Sorry for the noise.

 TaskTrackers fail at startup
 

 Key: MAPREDUCE-2946
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2946
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: tasktracker
Affects Versions: 0.20.205.0
Reporter: Owen O'Malley
 Fix For: 0.20.205.0


 Upgrading from 0.20.204.0 to 0.20.205.0-SNAPSHOT, the TaskTrackers refused to 
 start because the cleanup failed. I was able to start the task trackers by 
 deleting the mapred localdirs across the cluster.
 I was running with the linux task controller and security turned on.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [jira] [Created] (MAPREDUCE-2911) Hamster: Hadoop And Mpi on the same cluSTER

2011-09-01 Thread Owen O'Malley
On Wed, Aug 31, 2011 at 7:22 PM, Josh Patterson j...@cloudera.com wrote:

 Do we have a list of all MR2 frameworks being worked on currently
 beyond MPI and Spark?


Giraph is also going to port over:

https://issues.apache.org/jira/browse/GIRAPH-13

-- Owen


[jira] [Resolved] (MAPREDUCE-1943) Implement limits on per-job JobConf, Counters, StatusReport, Split-Sizes

2011-08-25 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved MAPREDUCE-1943.
--

Resolution: Fixed

 Implement limits on per-job JobConf, Counters, StatusReport, Split-Sizes
 

 Key: MAPREDUCE-1943
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1943
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Mahadev konar
Assignee: Mahadev konar
 Fix For: 0.20.203.0

 Attachments: MAPREDUCE-1943-0.20-yahoo.patch, 
 MAPREDUCE-1943-0.20-yahoo.patch, MAPREDUCE-1943-yahoo-hadoop-0.20S-fix.patch, 
 MAPREDUCE-1943-yahoo-hadoop-0.20S-fix.patch, 
 MAPREDUCE-1943-yahoo-hadoop-0.20S-fix.patch, 
 MAPREDUCE-1943-yahoo-hadoop-0.20S-fix.patch, 
 MAPREDUCE-1943-yahoo-hadoop-0.20S-fix.patch, 
 MAPREDUCE-1943-yahoo-hadoop-0.20S-fix.patch, 
 MAPREDUCE-1943-yahoo-hadoop-0.20S.patch


 We have come across issues in production clusters wherein users abuse 
 counters, statusreport messages and split sizes. One such case was when one 
 of the users had 100 million counters. This leads to jobtracker going out of 
 memory and being unresponsive. In this jira I am proposing to put sane limits 
 on the status report length, the number of counters and the size of block 
 locations returned by the input split. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (MAPREDUCE-2846) a small % of all tasks fail with DefaultTaskController

2011-08-24 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved MAPREDUCE-2846.
--

   Resolution: Fixed
Fix Version/s: 0.23.0
   0.20.204.0
 Hadoop Flags: [Reviewed]

I just committed this.

 a small % of all tasks fail with DefaultTaskController
 --

 Key: MAPREDUCE-2846
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2846
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task, task-controller, tasktracker
Affects Versions: 0.20.204.0
Reporter: Allen Wittenauer
Assignee: Owen O'Malley
Priority: Blocker
 Fix For: 0.20.204.0, 0.23.0

 Attachments: sync-trunk.patch, sync.patch


 After upgrading our test 0.20.203 grid to 0.20.204-rc2, we ran terasort to 
 verify operation.  While the job completed successfully, approx 10% of the 
 tasks failed with task runner execution errors and the inability to create 
 symlinks for attempt logs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (MAPREDUCE-2688) rpm should only require the same major version as common and hdfs

2011-07-21 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved MAPREDUCE-2688.
--

   Resolution: Duplicate
Fix Version/s: 0.23.0

This was fixed by HDFS-2156.

 rpm should only require the same major version as common and hdfs
 -

 Key: MAPREDUCE-2688
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2688
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Owen O'Malley
 Fix For: 0.23.0


 The rpm should only require the same version of common and hdfs be installed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (MAPREDUCE-2688) rpm should only require the same major version as common and hdfs

2011-07-15 Thread Owen O'Malley (JIRA)
rpm should only require the same major version as common and hdfs
-

 Key: MAPREDUCE-2688
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2688
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Owen O'Malley


The rpm should only require the same version of common and hdfs be installed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: mappers

2011-06-26 Thread Owen O'Malley
Look in the job history file. It has a line for each event of the job
including task start and finish.

-- Owen

On Jun 26, 2011, at 2:17 AM, Keren Ouaknine ker...@gmail.com wrote:

 Hello,

 I am looking for the actual number of mappers on each machine for the job. I
 know how to configure the max number (mapred.tasktracker.map.tasks.maximum
 in mapred-site.xml file), but not the actual number of mappers that were
 running for a completed job.

 Any idea where can I find this data?
 Thanks,
 Keren

 --
 Keren Ouaknine
 Cell: +972 54 2565404
 Web: www.kereno.com


[jira] [Resolved] (MAPREDUCE-587) Stream test TestStreamingExitStatus fails with Out of Memory

2011-06-10 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved MAPREDUCE-587.
-

   Resolution: Fixed
Fix Version/s: 0.23.0

This is already committed to trunk.

 Stream test TestStreamingExitStatus fails with Out of Memory
 

 Key: MAPREDUCE-587
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-587
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
 Environment: OS/X, 64-bit x86 imac, 4GB RAM.
Reporter: Steve Loughran
Assignee: Amar Kamat
Priority: Minor
 Fix For: 0.23.0

 Attachments: MAPREDUCE-587-v1.0.patch, mr-587-yahoo-y20-v1.0.patch, 
 mr-587-yahoo-y20-v1.1.patch


 contrib/streaming tests are failing a test with an Out of Memory error on an 
 OS/X Mac -same problem does not surface on Linux.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAPREDUCE-2506) Create a compatible interface for frameworks that need to clone MapReduce context objects.

2011-05-17 Thread Owen O'Malley (JIRA)
Create a compatible interface for frameworks that need to clone MapReduce 
context objects.
--

 Key: MAPREDUCE-2506
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2506
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Owen O'Malley


In 0.21 we moved the org.apache.hadoop.mapreduce context objects to interfaces.

That made the APIs much better, but broke backwards compatibility for 
frameworks that need to clone them. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Created: (MAPREDUCE-2359) Distributed cache doesn't use non-default FileSystems correctly

2011-03-07 Thread Owen O'Malley (JIRA)
Distributed cache doesn't use non-default FileSystems correctly
---

 Key: MAPREDUCE-2359
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2359
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Owen O'Malley
Assignee: Krishna Ramachandran
 Fix For: 0.20.100


We are passing fs.deafult.name as viewfs:/// in core site.xml on oozie server.
We have default name node in configuration also viewfs:///

We are using hdfs://path in our path for application.
Its giving following error:

IllegalArgumentException: Wrong FS:
hdfs://nn/user/strat_ci/oozie-oozi/002-110217014830452-oozie-oozi-W/hadoop1--map-reduce/map-reduce-launcher.jar,
expected: viewfs:/

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Created: (MAPREDUCE-2360) Pig fails when using non-default FileSystem

2011-03-07 Thread Owen O'Malley (JIRA)
Pig fails when using non-default FileSystem
---

 Key: MAPREDUCE-2360
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2360
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: client
Reporter: Owen O'Malley
 Fix For: 0.20.100


The job client strips the file system from the user's job jar, which causes 
breakage when it isn't the default file system.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Created: (MAPREDUCE-2361) Distributed Cache is not adding files to class paths correctly

2011-03-07 Thread Owen O'Malley (JIRA)
Distributed Cache is not adding files to class paths correctly
--

 Key: MAPREDUCE-2361
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2361
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Reporter: Owen O'Malley
Assignee: Chris Douglas


I am trying to add files into class path using:
DistributedCache.addFileToClassPath

If file path is relative path like: /user/dir1/dir2/a.jar everything is ok 
means if I try to get these files from the
class path using DistributedCache.getFileClassPaths it returns me the path 
correctly.

However if I use path like hdfs://nn:7877/user/dir1/dir2/a.jar 
And try to get class path files using DistributedCache.getFileClassPaths
it returns me 3 files like:

hdfs
//nn
7877/user/dir1/dir2/a.jar


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Created: (MAPREDUCE-2362) Unit test failures: TestBadRecords and TestTaskTrackerMemoryManager

2011-03-07 Thread Owen O'Malley (JIRA)
Unit test failures: TestBadRecords and TestTaskTrackerMemoryManager
---

 Key: MAPREDUCE-2362
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2362
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: test
Reporter: Owen O'Malley
Assignee: Greg Roelofs
 Fix For: 0.20.100


Fix unit-test failures: TestBadRecords (NPE due to rearranged MapTask code) and 
TestTaskTrackerMemoryManager (need hostname in output-string pattern).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Created: (MAPREDUCE-2363) Bad error messages for queues without acls

2011-03-07 Thread Owen O'Malley (JIRA)
Bad error messages for queues without acls
--

 Key: MAPREDUCE-2363
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2363
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/capacity-sched
Reporter: Owen O'Malley
Assignee: Dick King
 Fix For: 0.20.100


When a queue is built without any access rights, the error message is very bad.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Created: (MAPREDUCE-2364) Shouldn't hold lock on rjob while localizing resources.

2011-03-07 Thread Owen O'Malley (JIRA)
Shouldn't hold lock on rjob while localizing resources.
---

 Key: MAPREDUCE-2364
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2364
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: tasktracker
Affects Versions: 0.20.100
Reporter: Owen O'Malley
Assignee: Devaraj Das
 Fix For: 0.20.100


There is a deadlock while localizing resources on the TaskTracker.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Created: (MAPREDUCE-2365) Add counters for FileInputFormat (BYTES_READ) and FileOutputFormat (BYTES_WRITTEN)

2011-03-07 Thread Owen O'Malley (JIRA)
Add counters for FileInputFormat (BYTES_READ) and FileOutputFormat 
(BYTES_WRITTEN)
--

 Key: MAPREDUCE-2365
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2365
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Owen O'Malley


MAP_INPUT_BYTES and MAP_OUTPUT_BYTES will be computed using the difference 
between FileSystem
counters before and after each next(K,V) and collect/write op.

In case compression is being used, these counters will represent the compressed 
data sizes. The uncompressed size will
not be available.

This is not a direct back-port of 5710. (Counters will be computed in MapTask 
instead of in individual RecordReaders).

0.20.100 -
   New API - MAP_INPUT_BYTES will be computed using this method
   Old API - MAP_INPUT_BYTES will remain unchanged.

0.23 -
   New API - MAP_INPUT_BYTES will be computed using this method
   Old API - MAP_INPUT_BYTES likely to use this method


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Created: (MAPREDUCE-2366) TaskTracker can't retrieve stdout and stderr from web UI

2011-03-07 Thread Owen O'Malley (JIRA)
TaskTracker can't retrieve stdout and stderr from web UI


 Key: MAPREDUCE-2366
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2366
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: tasktracker
Reporter: Owen O'Malley
Assignee: Dick King
 Fix For: 0.20.100


Problem where the task browser UI can't retrieve the stdxxx printouts of 
streaming jobs that abend in the unix code, in the common case where the 
containing job doesn't reuse JVM's.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Created: (MAPREDUCE-2355) Add an out of band heartbeat damper

2011-03-04 Thread Owen O'Malley (JIRA)
Add an out of band heartbeat damper
---

 Key: MAPREDUCE-2355
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2355
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: jobtracker
Reporter: Owen O'Malley
Assignee: Arun C Murthy


We should have a configurable knob to throttle how many out of band heartbeats 
are sent.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (MAPREDUCE-2357) When extending inputsplit (non-FileSplit), all exceptions are ignored

2011-03-04 Thread Owen O'Malley (JIRA)
When extending inputsplit (non-FileSplit), all exceptions are ignored
-

 Key: MAPREDUCE-2357
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2357
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Reporter: Owen O'Malley
Assignee: Luke Lu
 Fix For: 0.20.100


if you're using a custom RecordReader/InputFormat setup and using an
InputSplit that does NOT extend FileSplit, then any exceptions you throw in 
your RecordReader.nextKeyValue() function
are silently ignored.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (MAPREDUCE-2358) MapReduce assumes HDFS as the default filesystem

2011-03-04 Thread Owen O'Malley (JIRA)
MapReduce assumes HDFS as the default filesystem


 Key: MAPREDUCE-2358
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2358
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Owen O'Malley
Assignee: Krishna Ramachandran
 Fix For: 0.20.100


Mapred assumes hdfs as the default fs even when defined otherwise.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (MAPREDUCE-2262) Capacity Scheduler unit tests fail with class not found

2011-01-13 Thread Owen O'Malley (JIRA)
Capacity Scheduler unit tests fail with class not found
---

 Key: MAPREDUCE-2262
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2262
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/capacity-sched
Reporter: Owen O'Malley
Assignee: Owen O'Malley
 Fix For: 0.20.3


Currently the ivy.xml file for the capacity scheduler doesn't include the 
commons-cli, leading to class not found exceptions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Where ask questions about MapReduce source code?

2010-12-22 Thread Owen O'Malley
On Wed, Dec 22, 2010 at 3:29 PM, Pedro Costa psdc1...@gmail.com wrote:

 Hi,

 I would like to understand some parts of the Map Reduce  source code,
 and I don't know where to ask. Should I ask here?


Yes.


[jira] Created: (MAPREDUCE-2188) The new API MultithreadedMapper doesn't call the initialize method of the RecordReader

2010-11-15 Thread Owen O'Malley (JIRA)
The new API MultithreadedMapper doesn't call the initialize method of the 
RecordReader
--

 Key: MAPREDUCE-2188
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2188
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Owen O'Malley
Assignee: Owen O'Malley
 Fix For: 0.22.0


The wrapping RecordReader in the Multithreaded Mapper is never initialized. 
With HADOOP-6685, this becomes a problem because the ReflectionUtils.copy 
requires a non-null configuration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Trunk build failing with ivy errors

2010-11-10 Thread Owen O'Malley
On Wed, Nov 10, 2010 at 3:48 PM, Todd Lipcon t...@cloudera.com wrote:

 Tom has discovered that bumping the log4j version to 1.2.16 instead of
 1.2.15 fixes the issue...

 should we just do that?


I think that sounds reasonable.

-- Owen


[jira] Resolved: (MAPREDUCE-2164) MapredTestDriver.java compilation fails on trunk

2010-10-28 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved MAPREDUCE-2164.
--

Resolution: Cannot Reproduce

It compiles for me at this point. If it still fails for you, please reopen.

 MapredTestDriver.java compilation fails on trunk
 

 Key: MAPREDUCE-2164
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2164
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: test
Affects Versions: 0.22.0
Reporter: Giridharan Kesavan
Priority: Critical

 compile-mapred-test:
 [mkdir] Created dir: 
 /grid/0/hudson/hudson-slave/workspace/Hadoop-Mapreduce-trunk-Commit/trunk/build/test/mapred/classes
 [mkdir] Created dir: 
 /grid/0/hudson/hudson-slave/workspace/Hadoop-Mapreduce-trunk-Commit/trunk/build/test/mapred/testjar
 [mkdir] Created dir: 
 /grid/0/hudson/hudson-slave/workspace/Hadoop-Mapreduce-trunk-Commit/trunk/build/test/mapred/testshell
  
 [javac] Compiling 319 source files to 
 /grid/0/hudson/hudson-slave/workspace/Hadoop-Mapreduce-trunk-Commit/trunk/build/test/mapred/classes
 [javac] 
 /grid/0/hudson/hudson-slave/workspace/Hadoop-Mapreduce-trunk-Commit/trunk/src/test/mapred/org/apache/hadoop/test/MapredTestDriver.java:21:
  cannot find symbol
 [javac] symbol  : class TestSequenceFile
 [javac] location: package org.apache.hadoop.io
 [javac] import org.apache.hadoop.io.TestSequenceFile; 
 [javac]^
 [javac] 
 /grid/0/hudson/hudson-slave/workspace/Hadoop-Mapreduce-trunk-Commit/trunk/src/test/mapred/org/apache/hadoop/test/MapredTestDriver.java:59:
  cannot find symbol
 [javac] symbol  : class TestSequenceFile
 [javac] location: class org.apache.hadoop.test.MapredTestDriver
 [javac]   pgd.addClass(testsequencefile, TestSequenceFile.class, 
 [javac]^
 [javac] Note: Some input files use or override a deprecated API.
 [javac] Note: Recompile with -Xlint:deprecation for details.
 [javac] 2 errors

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: hadoop.job.ugi backwards compatibility

2010-09-14 Thread Owen O'Malley

On Sep 13, 2010, at 4:23 PM, Todd Lipcon wrote:

I agree that keeping API compatibility for UGI was probably  
impossible, and
respect that. But it would certainly be very easy to do a patch like  
the

following:

JobClient(Configuration conf) {
 if (conf.get(hadoop.job.ugi) != null 
UserGroupInformation.isSecurityEnabled()) {
   LOG.warn(Stop being evil. Don't use hadoop.job.ugi! RAAWR);
   UserGroupInformation.createRemoteUser(...).doAs() { create proxy }
 } else {
   create normal RPC proxy;
 }
}


My problem is three fold:
  1. It isn't one or two spots. It is a *lot* of spots. Doing it  
inconsistently would be far worse than useless.
  2. Having two different authentication paths dramatically increases  
the chance for bugs.
  3. The previously mentioned badness where the api semantics  
dramatically change with the value of a config variable that isn't  
there to enable backwards compatibility.


Furthermore, the upside is really small consisting of only the users  
that have:

  1. developed internal servers that handle multiple users.
  2. on hadoop 0.20
  3. never plan on turning on security
  4. are interested in moving to 0.21 or 0.22
  5. aren't willing to do the straightforward fixes to their code.

-- Owen


Re: hadoop.job.ugi backwards compatibility

2010-09-13 Thread Owen O'Malley
Moving the discussion over to the more appropriate mapreduce-dev.

On Mon, Sep 13, 2010 at 9:08 AM, Todd Lipcon t...@cloudera.com wrote:

 1) Groups resolution happens on the server side, where it used to happen on
 the client. Thus, all Hadoop users must exist on the NN/JT machines in order
 for group mapping to succeed (or the user must write a custom group mapper).

There is a plugin that performs the group lookup. See HADOOP-4656.
There is no requirement for having the user accounts on the NN/JT
although that is the easiest approach. It is not recommended that the
users be allowed to login.

I think it is important that turning security on and off doesn't
drastically change the semantics or protocols. That will become much
much harder to support downstream.

 2) The hadoop.job.ugi parameter is ignored - instead the user has to use the
 new UGI.createRemoteUser(foo).doAs() API, even in simple security.

User code that counts on hadoop.job.ugi working will be horribly
broken once you turn on security. Turning on and off security should
not involve testing all of your applications. It is unfortunate that
we ever used the configuration value as the user, but continuing to
support it will make our user's code much much more brittle.

-- Owen


Re: hadoop.job.ugi backwards compatibility

2010-09-13 Thread Owen O'Malley
On Mon, Sep 13, 2010 at 10:05 AM, Todd Lipcon t...@cloudera.com wrote:

 This is not MR-specific, since the strangely named hadoop.job.ugi determines
 HDFS permissions as well.

Yeah, after I hit send, I realized that I should have used common-dev.
This is really a dev issue.

 or the user must write a custom group mapper above refers to this plugin
 capability. But I think most users do not want to spend the time to write
 (or even setup) such a plugin beyond the default shell-based mapping
 service.

Sure, which is why it is easiest to just have the (hopefully disabled)
user accounts on the jt/nn. Any installs  100 nodes should be using
HADOOP-6864 to avoid the fork in the JT/NN.

 As someone who spends an awful lot of time doing downstream support of lots
 of different clusters, I actually disagree.

Normal applications never need to do doAs. They run as the default
user. This only comes up in servers that deal with multiple users. In
*that* context, it sucks having servers that only work in non-secure
mode. If some server X only works without security that sucks. Doing
doAs isn't harder, it is just different. Having two different
semantics models *will* cause lots of grief.

-- Owen


Re: hadoop.job.ugi backwards compatibility

2010-09-13 Thread Owen O'Malley
On Mon, Sep 13, 2010 at 11:10 AM, Todd Lipcon t...@cloudera.com wrote:
 Yep, but there are plenty of 10 node clusters out there that do important
 work at small startups or single-use-case installations, too. We need to
 provide scalability and security features that work for the 100+ node
 clusters but also not leave the beginners in the dust.

10 node clusters are an important use case, but creating the user
accounts on those clusters is very easy because of the few users.
Futhermore, if the accounts aren't there it just means the users have
no groups. Which for a single use system with security turned off
isn't the end of the world.

 But I think there are plenty of people out there who have built small
 webapps, shell scripts, cron jobs, etc that use hadoop.job.ugi on some
 shared account to impersonate other users.

I'd be surprised. At Yahoo, the primary problem came with people
screen scraping the jobtracker http. With security turned off that
isn't an issue. Again, it isn't hard, just the evolving interface of
UserGroupInformation changed. With security, we tried really hard to
maintain backwards compatibility and succeeded for the vast (99%+)
majority of the users.

 Perhaps I am estimating
 incorrectly - that's why I wanted this discussion on a user-facing list
 rather than a dev-facing list.

Obviously the pointer is there for them to follow into the rabbit hole
of the dev lists. *grin*

 Another example use case that I do a lot on non-secure clusters is: hadoop
 fs -Dhadoop.job.ugi=hadoop,hadoop something I want to do as a superuser.
 The permissions model we have in 0.20 obviously isn't secure, but it's nice
 to avoid accidental mistakes, and making it easy to sudo like that is
 handy.

It might make sense to add a new switch ( -user ?) to hadoop fs that
does a doAs before doing the
shell command. You could even make it fancy and try to be a proxy user
if security is turned on.

 Regardless of our particular opinions, isn't our policy that we cannot break
 API compatibility between versions without a one-version deprecation period?

There wasn't a way to keep UGI stable. It was a broken design before
the security work. It is marked evolving so we try to minimize
breakage, but it isn't prohibited.

-- Owen


[jira] Resolved: (MAPREDUCE-2046) A input split cannot be less than a dfs block

2010-08-31 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved MAPREDUCE-2046.
--

Resolution: Cannot Reproduce

This isn't true. InputSplits can be arbitrarily sized by the InputFormat. 
mapred.TextInputFormat if you set the number of maps very high, you will 
generate a large number of maps. In the new mapreduce.in.TextInputFormat, there 
are knobs that set the minimum and maximum block size.

 A input split cannot be less than a dfs block 
 --

 Key: MAPREDUCE-2046
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2046
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Namit Jain

 I ran into this while testing some hive features.
 Whether we use hiveinputformat or combinehiveinputformat, a split cannot be 
 less than a dfs block size.
 This is a problem if we want to increase the block size for older data to 
 reduce memory consumption for the
 name node.
 It would be useful if the input split was independent of the dfs block size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAPREDUCE-2007) Is it possible that use ArrayList or other type instead Iterable when use reduce(Object, Iterable, Context)?

2010-08-12 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved MAPREDUCE-2007.
--

Resolution: Won't Fix

The framework can't assume that all of the values fit into memory, so it is not 
possible to make the API require a List object.

If you are just counting values, you should consider replacing the value with 
an integer and implement a combiner that adds the counts together. It will be 
much more efficient. Look at the word count example for an example of how to do 
this.

If you just need the first N values, just iterate through the values you need 
and return from the reduce method. There is no need to exhaust the iterator.

 Is it possible that use ArrayList or other type  instead Iterable  when use 
 reduce(Object, Iterable, Context)?
 --

 Key: MAPREDUCE-2007
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2007
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 0.20.2
Reporter: Hui Wen Han
 Fix For: 0.20.2


 1) Sometimes We only need get the elements count of the input values of 
 Reducer task,
 but we have to iterate all the input values to calculate it.
 2) Sometimes We only need get a few elements (for example top n,last n ,or 
 random ) from  the input values of Reducer task,
 if it can use ArrayList or other type  instead Iterable  when use 
 reduce(Object, Iterable, Context),it 's more conveniency.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1786) Add method to support pre-partitioned data

2010-05-12 Thread Owen O'Malley (JIRA)
Add method to support pre-partitioned data
--

 Key: MAPREDUCE-1786
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1786
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Reporter: Owen O'Malley
Assignee: Owen O'Malley


There are some applications where the map wants to partition the data itself. 
This happens in Pipes, if the user has a C++ partitioner. It would make sense 
to support it in streaming too. There is also use case where the Java 
partitioner needs the context object to update counters, etc.

This jira is only about adding the method to the mapreduce Java API. The Pipes 
interface can be updated in a follow up Jira.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Un-deprecate the old MapReduce API?

2010-04-22 Thread Owen O'Malley
On the various pieces, I think:

0.20: -0 for removing the deprecation, +1 for improving the
deprecation message with links to the corresponding class.

0.21: new core api should be stable except for Job and Cluster
new library code should be evolving
-1 for removing the deprecation, we need to

0.22: all of the new api should be stable and the old api deprecated.

 Currently there is almost no way to write a moderately complex MR job that 
 doesn't spew deprecation warnings.

That is false in 0.21.

-- Owen


[jira] Created: (MAPREDUCE-1669) The imported JSON credentials should support binary secrets.

2010-04-02 Thread Owen O'Malley (JIRA)
The imported JSON credentials should support binary secrets.


 Key: MAPREDUCE-1669
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1669
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Owen O'Malley


Currently, we support adding a file with secrets to a job. It can either be in 
binary or JSON, but the JSON format assumes that all of the secrets are UTF-8, 
which is often false. We should pick a format that allows binary data to be 
included.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[ANNOUCEMENT] Pig Hadoop Contributors Workshop at Yahoo!

2010-03-25 Thread Owen O'Malley

Hello Hadoop Contributors,

The Yahoo Hadoop Development team would like to invite you to a  
Contributors Workshop for Hadoop Core (HDFS  Map-Reduce) and another  
for Pig on the day following the Hadoop Summit.  The purpose of the  
workshops is to collectively discuss challenges, concerns and future  
ideas around Hadoop and Pig technologies.


When: June 30th 2010 @ 10:00 am – 3:00 pm
Where: Yahoo! Buildling C, Classroom 5 @ 701 First Avenue, Sunnyvale  
CA 94089


If you have any suggestion for agenda items, please suggest them on  
the relevant developers list.  Owen O'Malley has volunteered to help  
organize the Hadoop Core meeting and Alan Gates is doing the same for  
Pig.


Please RSVP by sending an email to hadoopcontributorr...@yahoo-inc.com  
before May 30th if you plan to attend.


See you all at the Hadoop Summit – June 29th, http://www.hadoopsummit.org/

Looking forward to meeting more of you!

Eric Baldeschwieler  Owen O'Malley

PS We would be happy to provide space for any of the other Hadoop sub- 
projects as well!  If you are interested in organizing such a  
workshop, please email us at hadoopcontributorr...@yahoo-inc.com with  
WORKSHOP ORGANIZER (project) in the subject line. I'll send a  
separate email to their dev lists with this invitation also.

Re: [ANNOUCEMENT] Pig Hadoop Contributors Workshop at Yahoo!

2010-03-25 Thread Owen O'Malley


On Mar 25, 2010, at 10:20 AM, Owen O'Malley wrote:

Please RSVP by sending an email to hadoopcontributorr...@yahoo- 
inc.com before May 30th if you plan to attend.


The proper email is: hadoopcontribu...@yahoo-inc.com . Sorry for the  
confusion.


-- Owen

[jira] Created: (MAPREDUCE-1566) Need to add a mechanism to import tokens and secrets into a submitted job.

2010-03-05 Thread Owen O'Malley (JIRA)
Need to add a mechanism to import tokens and secrets into a submitted job.
--

 Key: MAPREDUCE-1566
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1566
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: security
Reporter: Owen O'Malley
Assignee: Owen O'Malley
 Fix For: 0.22.0


We need to include tokens and secrets into a submitted job. I propose adding a 
configuration attribute that when pointed at a token storage file will include 
the tokens and secrets from that token storage file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1567) Sharing Credentials between JobConfs leads to unintentional sharing of credentials

2010-03-05 Thread Owen O'Malley (JIRA)
Sharing Credentials between JobConfs leads to unintentional sharing of 
credentials
--

 Key: MAPREDUCE-1567
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1567
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: security
Reporter: Owen O'Malley
Assignee: Owen O'Malley
 Fix For: 0.22.0


Currently, if code does new JobConf(jobConf), it will share the Credentials. 
That leads to unintentional sharing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1528) TokenStorage should not be static

2010-02-23 Thread Owen O'Malley (JIRA)
TokenStorage should not be static
-

 Key: MAPREDUCE-1528
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1528
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Owen O'Malley


Currently, TokenStorage is a singleton. This doesn't work for some use cases, 
such as Oozie. I think that each Job should have a TokenStorage that is 
associated it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1515) need to pass down java5 and forrest home variables

2010-02-20 Thread Owen O'Malley (JIRA)
need to pass down java5 and forrest home variables
--

 Key: MAPREDUCE-1515
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1515
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: build
Reporter: Owen O'Malley
Assignee: Owen O'Malley
 Fix For: 0.22.0
 Attachments: m-1515.patch

Currently, the build script doesn't pass down the variables for java5 and 
forrest, so the build breaks unless they are on the command line.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1503) Push HADOOP-6551 into MapReduce

2010-02-18 Thread Owen O'Malley (JIRA)
Push HADOOP-6551 into MapReduce
---

 Key: MAPREDUCE-1503
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1503
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Owen O'Malley


We need to throw readable exceptions instead of returning false.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1470) Move Delegation token into Common so that we can use it for MapReduce also

2010-02-08 Thread Owen O'Malley (JIRA)
Move Delegation token into Common so that we can use it for MapReduce also
--

 Key: MAPREDUCE-1470
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1470
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Owen O'Malley
Assignee: Owen O'Malley


We need to update one reference for map/reduce when we move the hdfs delegation 
tokens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1462) Enable context-specific and stateful serializers in MapReduce

2010-02-04 Thread Owen O'Malley (JIRA)
Enable context-specific and stateful serializers in MapReduce
-

 Key: MAPREDUCE-1462
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1462
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: task
Reporter: Owen O'Malley
Assignee: Owen O'Malley


Although the current serializer framework is powerful, within the context of a 
job it is limited to picking a single serializer for a given class. 
Additionally, Avro generic serialization can make use of additional 
configuration/state such as the schema. (Most other serialization frameworks 
including Writable, Jute/Record IO, Thrift, Avro Specific, and Protocol Buffers 
only need the object's class name to deserialize the object.)

With the goal of keeping the easy things easy and maintaining backwards 
compatibility, we should be able to allow applications to use context specific 
(eg. map output key) serializers in addition to the current type based ones 
that handle the majority of the cases. Furthermore, we should be able to 
support serializer specific configuration/metadata in a type safe manor without 
cluttering up the base API with a lot of new methods that will confuse new 
users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1440) MapReduce should use the short form of the user names

2010-02-02 Thread Owen O'Malley (JIRA)
MapReduce should use the short form of the user names
-

 Key: MAPREDUCE-1440
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1440
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: security
Reporter: Owen O'Malley
 Fix For: 0.22.0


To minimize disruption on MapReduce, we should use the local names (ie. 
omalley) rather than the long names (ie. omal...@apache.org as the basis 
for the username in MapReduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAPREDUCE-1385) Make changes to MapReduce for the new UserGroupInformation APIs (HADOOP-6299)

2010-01-27 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved MAPREDUCE-1385.
--

  Resolution: Fixed
Hadoop Flags: [Incompatible change, Reviewed]

I just committed this. Thanks, Devaraj!

 Make changes to MapReduce for the new UserGroupInformation APIs (HADOOP-6299)
 -

 Key: MAPREDUCE-1385
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1385
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Devaraj Das
Assignee: Devaraj Das
 Fix For: 0.22.0

 Attachments: mr-6299.3.patch, mr-6299.7.patch, mr-6299.8.patch, 
 mr-6299.patch


 This is about moving the MapReduce code to use the new UserGroupInformation 
 API as described in HADOOP-6299.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Moving HDFS raid package from HDFS repository to MAPREDUCE repository

2010-01-22 Thread Owen O'Malley
The issue is that we need to avoid loops in the project dependencies.
Therefore, the order has to go:

Common - HDFS - MapReduce

The problem is that RAID needs MapReduce and therefore can't be put into
HDFS.

-- Owen


[jira] Reopened: (MAPREDUCE-1126) shuffle should use serialization to get comparator

2010-01-15 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley reopened MAPREDUCE-1126:
--


-1 to this massive API change without much more dialog. The scope of the patch 
was much larger than the description.

 shuffle should use serialization to get comparator
 --

 Key: MAPREDUCE-1126
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Reporter: Doug Cutting
Assignee: Aaron Kimball
 Fix For: 0.22.0

 Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, 
 MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, 
 MAPREDUCE-1126.patch


 Currently the key comparator is defined as a Java class.  Instead we should 
 use the Serialization API to create key comparators.  This would permit, 
 e.g., Avro-based comparators to be used, permitting efficient sorting of 
 complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1274) The completed job web ui urls include full path names to the local file system on the JobTracker.

2009-12-08 Thread Owen O'Malley (JIRA)
The completed job web ui urls include full path names to the local file system 
on the JobTracker.
-

 Key: MAPREDUCE-1274
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1274
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: security
Affects Versions: 0.21.0
Reporter: Owen O'Malley
Priority: Blocker
 Fix For: 0.21.0


Currently, the web ui for MapReduce in 0.21.0-dev include a path to a local 
file in the url:

http://jt.foo.com:50030/jobdetailshistory.jsp?jobid=job_200912012129_0001logFile=file%3A%2Fopt%2Flocal%2Fowen%2Fhadoop%2Frun%2Flogs%2Fhistory%2Fdone%2Fjob_200912012129_0001_oom

This implies a security bug where the user uses logFile=/etc/passwd or some 
other annoying trick. 

I suspect the answer is applying MAPREDUCE-1185 back to 0.21.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (MAPREDUCE-1244) eclipse-plugin fails with missing dependencies

2009-12-04 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley reopened MAPREDUCE-1244:
--


We need to apply this to 0.21 also.

 eclipse-plugin fails with missing dependencies
 --

 Key: MAPREDUCE-1244
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1244
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: build
Affects Versions: 0.22.0
Reporter: Giridharan Kesavan
Assignee: Giridharan Kesavan
 Fix For: 0.21.0, 0.22.0

 Attachments: mapred-1244.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1241) JobTracker should not crash when mapred-queues.xml does not exist

2009-11-25 Thread Owen O'Malley (JIRA)
JobTracker should not crash when mapred-queues.xml does not exist
-

 Key: MAPREDUCE-1241
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1241
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Owen O'Malley
Priority: Blocker
 Fix For: 0.21.0, 0.22.0


Currently, if you bring up the JobTracker on an old configuration directory, it 
gets a NullPointerException looking for the mapred-queues.xml file. It should 
just assume a default queue and continue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Ideas for dynamic change reducer task number ?

2009-11-23 Thread Owen O'Malley


On Nov 22, 2009, at 4:48 PM, Jeff Zhang wrote:

My concern is that it is just like hard code to use  
conf.setNumReduceTasks

on the configuration. It is not flexible, so my idea is that adding an
interface to change the reducer number dynamically according the  
different

size of input data set.


You misunderstand. I meant doing something like:

public class MyInputFormat 

  public InputSplit[] getSplits(JobConf conf) {
 InputSplit[] result = ...;
 // compute total size of input
 conf.setNumReduceTasks(max(6, size / 10G));
  }
}

I haven't checked the code to make sure it will work, but I believe it  
will.


-- Owen


Re: Ideas for dynamic change reducer task number ?

2009-11-22 Thread Owen O'Malley
I'd suggest trying to do conf.setNumReduceTasks on the configuration passed
to the InputFormat in getSplits. It will probably just work.

-- Owen


[jira] Resolved: (MAPREDUCE-1091) TaskTrackers only work with same build as the JobTracker

2009-10-13 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved MAPREDUCE-1091.
--

Resolution: Won't Fix

 TaskTrackers only work with same build as the JobTracker
 

 Key: MAPREDUCE-1091
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1091
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: tasktracker
Affects Versions: 0.21.0
Reporter: Arun C Murthy
 Fix For: 0.21.0


 Currently tasktrackers check to ensure that they are the same build as the 
 JobTracker and bail-out if not. This is too restrictive - in the past we've 
 had similar complaints: HADOOP-5203.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Welcome Konstantin Boudnik as a qa committer!

2009-10-05 Thread Owen O'Malley
The Hadoop PMC has voted to make Cos a QA committer on Common, HDFS,  
and MapReduce. I'd like to welcome Cos as the newest committer.


-- Owen


[jira] Resolved: (MAPREDUCE-1014) After the 0.21 branch, MapReduce trunk doesn't compile

2009-09-22 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved MAPREDUCE-1014.
--

Resolution: Fixed

I updated the common and hdfs jars with the current ones.

 After the 0.21 branch, MapReduce trunk doesn't compile
 --

 Key: MAPREDUCE-1014
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1014
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.22.0
Reporter: Devaraj Das
Assignee: Ravi Gummadi
Priority: Blocker
 Fix For: 0.22.0


 When ant is run, the build fails with compilation problems. The first of that 
 is:
 compile-mapred-classes:
   [taskdef] log4j:ERROR Could not instantiate class 
 [org.apache.hadoop.metrics.jvm.EventCounter].
   [taskdef] java.lang.ClassNotFoundException: 
 org.apache.hadoop.metrics.jvm.EventCounter
   [taskdef] at 
 org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1383)
   [taskdef] at 
 org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1324)
   [taskdef] at 
 org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1072)
   [taskdef] at java.lang.ClassLoader.loadClass(ClassLoader.java:254)
   [taskdef] at 
 java.lang.ClassLoader.loadClassInternal(ClassLoader.java:402)
   [taskdef] at java.lang.Class.forName0(Native Method)
   [taskdef] at java.lang.Class.forName(Class.java:169)
   [taskdef] at org.apache.log4j.helpers.Loader.loadClass(Loader.java:179)
   [taskdef] at 
 org.apache.log4j.helpers.OptionConverter.instantiateByClassName(OptionConverter.java:320)
   [taskdef] at 
 org.apache.log4j.helpers.OptionConverter.instantiateByKey(OptionConverter.java:121)
   [taskdef] at 
 org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:664)
   [taskdef] at 
 org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:647)
   [taskdef] at 
 org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:544)
   [taskdef] at 
 org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:440)
   [taskdef] at 
 org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:476)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1026) Shuffle should be secure

2009-09-22 Thread Owen O'Malley (JIRA)
Shuffle should be secure


 Key: MAPREDUCE-1026
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1026
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: security
Reporter: Owen O'Malley
Assignee: Devaraj Das


Since the user's data is available via http from the TaskTrackers, we should 
require a job-specific secret to access it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [VOTE] Commit MAPREDUCE-728 to Hadoop 0.21

2009-09-21 Thread Owen O'Malley


On Sep 18, 2009, at 10:24 PM, Hong Tang wrote:

Given the circumstances, I would like to request a vote to commit  
MAPREDUCE-728 to Hadoop 0.21.


Mumak has already found a couple bugs in map/reduce and promises to  
find more. I think this is a good low-risk addition to 0.21.


+1

-- Owen


[jira] Created: (MAPREDUCE-1016) Make the format of the Job History be JSON instead of Avro binary

2009-09-21 Thread Owen O'Malley (JIRA)
Make the format of the Job History be JSON instead of Avro binary
-

 Key: MAPREDUCE-1016
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1016
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Owen O'Malley
 Fix For: 0.21.0, 0.22.0


I forgot that one of the features that would be nice is to off load the job 
history display from the JobTracker. That will be a lot easier, if the job 
history is stored in JSON. Therefore, I think we should change the storage now 
to prevent incompatibilities later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: branching mapred

2009-09-15 Thread Owen O'Malley


On Sep 15, 2009, at 9:31 AM, Steve Loughran wrote:



I've created a little branch where I've synced up my lifecycle-aware  
services with the moved bits


http://svn.apache.org/viewvc/hadoop/mapreduce/branches/MAPREDUCE-233/


+1


[jira] Created: (MAPREDUCE-954) The new interface's Context objects should be interfaces

2009-09-04 Thread Owen O'Malley (JIRA)
The new interface's Context objects should be interfaces


 Key: MAPREDUCE-954
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-954
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Reporter: Owen O'Malley
 Fix For: 0.21.0


When I was doing HADOOP-1230, I was persuaded to make the Context objects as 
classes. I think that was a serious mistake. It caused a lot of information 
leakage into the public classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAPREDUCE-421) mapred pipes might return exit code 0 even when failing

2009-08-27 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved MAPREDUCE-421.
-

  Resolution: Fixed
Hadoop Flags: [Reviewed]

I realized that this was difficult to test.

I just committed this. Thanks, Christian!



 mapred pipes might return exit code 0 even when failing
 ---

 Key: MAPREDUCE-421
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-421
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: pipes
Reporter: Christian Kunz
Assignee: Christian Kunz
 Fix For: 0.20.1

 Attachments: MAPREDUCE-421.patch


 up to  hadoop 0.18.3 org.apache.hadoop.mapred.JobShell ensured that 'hadoop 
 jar' returns non-zero exit code when the job fails.
 This is no longer true after moving this to org.apache.hadoop.util.RunJar.
 Pipes jobs submitted through cli never returned proper exit code.
 The main methods in org.apache.hadoop.util.RunJar. and 
 org.apache.hadoop.mapred.pipes.Submitter should be modified to return an exit 
 code similar to how org.apache.hadoop.mapred.JobShell did it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-917) Remove getInputCounter and getOutputCounter from Contexts

2009-08-26 Thread Owen O'Malley (JIRA)
Remove getInputCounter and getOutputCounter from Contexts
-

 Key: MAPREDUCE-917
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-917
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Affects Versions: 0.21.0
Reporter: Owen O'Malley
Assignee: Amareshwari Sriramadasu
Priority: Blocker
 Fix For: 0.21.0


The getInputCounter and getOutputCounter methods need to be removed from the 
new mapreduce APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAPREDUCE-693) Conf files not moved to done subdirectory after JT restart

2009-08-26 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved MAPREDUCE-693.
-

   Resolution: Cannot Reproduce
Fix Version/s: (was: 0.20.1)

 Conf files not moved to done subdirectory after JT restart
 

 Key: MAPREDUCE-693
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-693
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobtracker
Affects Versions: 0.20.1
Reporter: Ramya R
Priority: Minor
 Attachments: MAPREDUCE-693-v1.1-branch-0.20.patch, 
 MAPREDUCE-693-v1.2-branch-0.20.patch


 After MAPREDUCE-516, when a job is submitted and the JT is restarted (before 
 job files have been written) and the job is killed after recovery, the conf 
 files fail to be moved to the done subdirectory.
 The exact scenario to reproduce this issue is:
 * Submit a job
 * Restart JT before anything is written to the job files
 * Kill the job
 * The old conf files remain in the history folder and fail to be moved to 
 done subdirectory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-777) A method for finding and tracking jobs from the new API

2009-07-20 Thread Owen O'Malley (JIRA)
A method for finding and tracking jobs from the new API
---

 Key: MAPREDUCE-777
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-777
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Owen O'Malley


We need to create a replacement interface for the JobClient API in the new 
interface. In particular, the user needs to be able to query and track jobs 
that were launched by other processes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (MAPREDUCE-716) org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle

2009-07-07 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley reopened MAPREDUCE-716:
-

  Assignee: evanand

Sorry, I thought the other jira was still open.

 org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle
 -

 Key: MAPREDUCE-716
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-716
 Project: Hadoop Map/Reduce
  Issue Type: Bug
 Environment: Java 1.6, HAdoop0.19.0, Linux..Oracle, 
Reporter: evanand
Assignee: evanand
 Attachments: HADOOP-5482.20-branch.patch, HADOOP-5482.patch, 
 HADOOP-5482.trunk.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle.
 The out of the box implementation of the Hadoop is working properly with 
 mysql/hsqldb, but NOT with oracle.
 Reason is DBInputformat is implemented with mysql/hsqldb specific query 
 constructs like LIMIT, OFFSET.
 FIX:
 building a database provider specific logic based on the database 
 providername (which we can get using connection).
 I HAVE ALREADY IMPLEMENTED IT FOR ORACLE...READY TO CHECK_IN CODE

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-726) Move the mapred script to map/reduce

2009-07-07 Thread Owen O'Malley (JIRA)
Move the mapred script to map/reduce


 Key: MAPREDUCE-726
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-726
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Owen O'Malley
Assignee: Dick King


The mapred script should be moved to mapreduce from Common. This is the 
parallel of HADOOP-6123.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAPREDUCE-712) TextWritter example is CPU bound!!

2009-07-06 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved MAPREDUCE-712.
-

Resolution: Invalid

16 maps on 8 cpus running gzip is expected to completely saturate cpu. This is 
not a bug!!!

Also check to see if you were using the native codec. If you are using the Java 
codec, it will be very slow and cpu bound.

 TextWritter example is CPU bound!!
 --

 Key: MAPREDUCE-712
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-712
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1, 0.21.0
 Environment: ~200 nodes cluster
 Each node has the following configuration:
 Processors: 2 x Xeon L5420 2.50GHz (8 cores) - Harpertown C0, 64-bit, 
 quad-core (8 CPUs)
 4 Disks
 16 GB RAM
 Linux 2.6
 Hadoop version: trunk
Reporter: Khaled Elmeleegy

 Running the RandomTextWritter example job ( from the examples jar) pegs the 
 machiens' CPUs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (MAPREDUCE-712) TextWritter example is CPU bound!!

2009-07-06 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley reopened MAPREDUCE-712:
-


I notice now that you didn't have compression. I wonder how much time you were 
spending in gc with such small heaps. That might explain the cpu load.

 TextWritter example is CPU bound!!
 --

 Key: MAPREDUCE-712
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-712
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1, 0.21.0
 Environment: ~200 nodes cluster
 Each node has the following configuration:
 Processors: 2 x Xeon L5420 2.50GHz (8 cores) - Harpertown C0, 64-bit, 
 quad-core (8 CPUs)
 4 Disks
 16 GB RAM
 Linux 2.6
 Hadoop version: trunk
Reporter: Khaled Elmeleegy

 Running the RandomTextWritter example job ( from the examples jar) pegs the 
 machiens' CPUs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: WELCOME to mapreduce-dev@hadoop.apache.org

2009-07-05 Thread Owen O'Malley


On Jul 5, 2009, at 1:23 PM, shruti jain wrote:


hi everyone,

I am trying to do svn checkout with ssh:
svn checkout svn+ssh://svn.apache.org/repos/asf/hadoop/common/trunk/
hadoop-common-trunk

But it asks for a password. What should I do ?


Use http instead of svn+ssh.

-- Owen