Re: Un-deprecate the old MapReduce API?

2010-04-23 Thread Alan Gates
I don't have any issue with un-deprecating the old APIs.  I agree if  
changes are needed it's better to mark the new APIs to reflect it.  I  
just hope those changes can be kept as backward compatible as  
possible.  In particular with Job, Pig uses that in some of it's APIs  
that it has declared stable (LoadFunc, StoreFunc).


Alan.

On Apr 22, 2010, at 11:30 PM, Arun C Murthy wrote:


Alan,

On Apr 22, 2010, at 12:12 PM, Alan Gates wrote:

Speaking for one power user (Pig) that did move to the new APIs,  
moving that interface to evolving is a little unsettling.  Is there  
a feel for how much the new API is going to change?




The intent isn't to mark the 'new' apis as 'Evolving' to change them  
willy-nilly... please don't read it so!


This is just a pragmatic proposal to reflect that the 'old' apis  
will, for lack of stabilization of new apis, be supported.


Given that, the new apis could mostly be stable, but for Job and  
Cluster - is that reasonable? This will ensure we send the right  
message all concerned regarding stability of o.a.h.mapreduce.{Mapper| 
Reducer|...}. Thoughts?


Arun


Alan.





Re: Un-deprecate the old MapReduce API?

2010-04-22 Thread Alan Gates
Speaking for one power user (Pig) that did move to the new APIs,  
moving that interface to evolving is a little unsettling.  Is there a  
feel for how much the new API is going to change?


Alan.

On Apr 21, 2010, at 2:24 PM, Tom White wrote:


The old MapReduce API in org.apache.hadoop.mapred was deprecated in
the 0.20 release series when the new (Context Objects) MapReduce API
was added in org.apache.hadoop.mapreduce. Unfortunately, the new API
was not complete in 0.20 and most users stayed with the old API. This
has led to the confusing situation where the old API is generally
recommended, even though it is deprecated.

To remedy this situation I suggest that we remove deprecations from
the old API in 0.20 and trunk, and mark the new API as Evolving (see
MAPREDUCE-1623 for the latter). This would mean a few things:

* The next 0.20 release would have a non-deprecated old API.
* The forthcoming 0.21 release would have a Stable (non-deprecated)
old API, and a Evolving new API.
* For some pre-1.0 release (perhaps 0.22), the old API could be
deprecated again, and the new API marked as Stable.
* In the 1.0 release it would be possible to remove the old API.

Thoughts?

Tom




Re: Map-Balance-Reduce draft

2010-02-08 Thread Alan Gates

Jian,

Sorry if any of my questions or comments would have been answered by  
the diagrams, but apache lists don't allow attachments, so I can't see  
your diagrams.


If I understand correctly, your suggestion for balancing is to apply  
reduce on subsets of the hashed data, and then run reduce again on  
this reduced data set.  Is that correct?  If so, how does this differ  
from the combiner?  Second, some aggregation operations truly aren't  
algebraic (that is, they cannot be distributed across multiple  
iterations of reduce).   An example of this is session analysis, where  
the algorithm truly needs to see all operations together to analyze  
the user session.  How do you propose to handle that case?


Alan.

On Feb 7, 2010, at 11:25 PM, jian yi wrote:


Two targets:
1. Solving the skew problem
2. Regarding a task as a timeslice to improve on scheduler,  
switching a job to another job by timeslice.


In MR (Map-Reduce) model, reducings are not balanced, because the  
scale of partitiones are unbalanced. How to balance? We can control  
the size of partition, rehash the bigger parition and combine to the  
specified size. If a key has many values, it's necessary to execute  
mapreduce twice.The following is the model digram:


Scheduler can regard a task as a timeslice similarly OS scheduler.
If a split is bigger than a specified size, it will be splitted  
again. If a split is smaller than a specified size, it will be  
combined with others, we can name the combining procedure regroup.  
The combining is logic, it's not necessay to combine these smaller  
splits to a disk file, which will not affect the performance.The  
target is that every task spent same time running.