Re: Un-deprecate the old MapReduce API?
I don't have any issue with un-deprecating the old APIs. I agree if changes are needed it's better to mark the new APIs to reflect it. I just hope those changes can be kept as backward compatible as possible. In particular with Job, Pig uses that in some of it's APIs that it has declared stable (LoadFunc, StoreFunc). Alan. On Apr 22, 2010, at 11:30 PM, Arun C Murthy wrote: Alan, On Apr 22, 2010, at 12:12 PM, Alan Gates wrote: Speaking for one power user (Pig) that did move to the new APIs, moving that interface to evolving is a little unsettling. Is there a feel for how much the new API is going to change? The intent isn't to mark the 'new' apis as 'Evolving' to change them willy-nilly... please don't read it so! This is just a pragmatic proposal to reflect that the 'old' apis will, for lack of stabilization of new apis, be supported. Given that, the new apis could mostly be stable, but for Job and Cluster - is that reasonable? This will ensure we send the right message all concerned regarding stability of o.a.h.mapreduce.{Mapper| Reducer|...}. Thoughts? Arun Alan.
Re: Un-deprecate the old MapReduce API?
Speaking for one power user (Pig) that did move to the new APIs, moving that interface to evolving is a little unsettling. Is there a feel for how much the new API is going to change? Alan. On Apr 21, 2010, at 2:24 PM, Tom White wrote: The old MapReduce API in org.apache.hadoop.mapred was deprecated in the 0.20 release series when the new (Context Objects) MapReduce API was added in org.apache.hadoop.mapreduce. Unfortunately, the new API was not complete in 0.20 and most users stayed with the old API. This has led to the confusing situation where the old API is generally recommended, even though it is deprecated. To remedy this situation I suggest that we remove deprecations from the old API in 0.20 and trunk, and mark the new API as Evolving (see MAPREDUCE-1623 for the latter). This would mean a few things: * The next 0.20 release would have a non-deprecated old API. * The forthcoming 0.21 release would have a Stable (non-deprecated) old API, and a Evolving new API. * For some pre-1.0 release (perhaps 0.22), the old API could be deprecated again, and the new API marked as Stable. * In the 1.0 release it would be possible to remove the old API. Thoughts? Tom
Re: Map-Balance-Reduce draft
Jian, Sorry if any of my questions or comments would have been answered by the diagrams, but apache lists don't allow attachments, so I can't see your diagrams. If I understand correctly, your suggestion for balancing is to apply reduce on subsets of the hashed data, and then run reduce again on this reduced data set. Is that correct? If so, how does this differ from the combiner? Second, some aggregation operations truly aren't algebraic (that is, they cannot be distributed across multiple iterations of reduce). An example of this is session analysis, where the algorithm truly needs to see all operations together to analyze the user session. How do you propose to handle that case? Alan. On Feb 7, 2010, at 11:25 PM, jian yi wrote: Two targets: 1. Solving the skew problem 2. Regarding a task as a timeslice to improve on scheduler, switching a job to another job by timeslice. In MR (Map-Reduce) model, reducings are not balanced, because the scale of partitiones are unbalanced. How to balance? We can control the size of partition, rehash the bigger parition and combine to the specified size. If a key has many values, it's necessary to execute mapreduce twice.The following is the model digram: Scheduler can regard a task as a timeslice similarly OS scheduler. If a split is bigger than a specified size, it will be splitted again. If a split is smaller than a specified size, it will be combined with others, we can name the combining procedure regroup. The combining is logic, it's not necessay to combine these smaller splits to a disk file, which will not affect the performance.The target is that every task spent same time running.