Images is not visible of gmail?//Map-Balance-Reduce draft

2010-02-08 Thread jian yi
Two targets:
1. Solving the skew problem
2. Regarding a task as a timeslice to improve on scheduler, switching a job
to another job by timeslice.

In MR (Map-Reduce) model, reducings are not balanced, because the scale of
partitiones are unbalanced. How to balance? We can control the size of
partition, rehash the bigger parition and combine to the specified size. If
a key has many values, it's necessary to execute mapreduce twice.The
following is the model digram:

mbr1.jpg (attachment)

Scheduler can regard a task as a timeslice similarly OS scheduler.
If a split is bigger than a specified size, it will be splitted again. If a
split is smaller than a specified size, it will be combined with others, we
can name the combining procedure regroup. The combining is logic, it's not
necessay to combine these smaller splits to a disk file, which will not
affect the performance.The target is that every task spent same time
running.

mbr2.jpg (attachment)


[jira] Reopened: (MAPREDUCE-927) Cleanup of task-logs should happen in TaskTracker instead of the Child

2010-02-08 Thread Amareshwari Sriramadasu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amareshwari Sriramadasu reopened MAPREDUCE-927:
---


 Cleanup of task-logs should happen in TaskTracker instead of the Child
 --

 Key: MAPREDUCE-927
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-927
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: tasktracker
Affects Versions: 0.21.0
Reporter: Vinod K V
Priority: Blocker
 Fix For: 0.21.0


 Task logs' cleanup is being done in Child now. This is undesirable atleast 
 for two reasons: 1) failures while cleaning up will affect the user's tasks, 
 and 2) the task's wall time will get affected due to operations that TT 
 actually should own.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Images is not visible of gmail?//Map-Balance-Reduce draft

2010-02-08 Thread Fusheng Han
Hi, jian

No attachments can be seen in gmail.

On Mon, Feb 8, 2010 at 5:23 PM, jian yi eyj...@gmail.com wrote:
 Two targets:
 1. Solving the skew problem
 2. Regarding a task as a timeslice to improve on scheduler, switching a job
 to another job by timeslice.
 In MR (Map-Reduce) model, reducings are not balanced, because the scale of
 partitiones are unbalanced. How to balance? We can control the size of
 partition, rehash the bigger parition and combine to the specified size. If
 a key has many values, it's necessary to execute mapreduce twice.The
 following is the model digram:
 mbr1.jpg (attachment)
 Scheduler can regard a task as a timeslice similarly OS scheduler.
 If a split is bigger than a specified size, it will be splitted again. If a
 split is smaller than a specified size, it will be combined with others, we
 can name the combining procedure regroup. The combining is logic, it's not
 necessay to combine these smaller splits to a disk file, which will not
 affect the performance.The target is that every task spent same time
 running.
 mbr2.jpg (attachment)


[jira] Created: (MAPREDUCE-1466) FileInputFormat should save #input-files in JobConf

2010-02-08 Thread Arun C Murthy (JIRA)
FileInputFormat should save #input-files in JobConf
---

 Key: MAPREDUCE-1466
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1466
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Reporter: Arun C Murthy
Assignee: Arun C Murthy
Priority: Minor
 Fix For: 0.22.0


We already track the amount of data consumed by MR applications 
(MAP_INPUT_BYTES), alongwith, it would be useful to #input-files from the 
client-side for analysis. Along the lines of MAPREDUCE-1403, it would be easy 
to stick in the JobConf during job-submission.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1467) Add a --verbose flag to Sqoop

2010-02-08 Thread Aaron Kimball (JIRA)
Add a --verbose flag to Sqoop
-

 Key: MAPREDUCE-1467
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1467
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: contrib/sqoop
Reporter: Aaron Kimball
Assignee: Aaron Kimball
Priority: Minor
 Attachments: MAPREDUCE-1467.patch

Need a {{--verbose}} flag that sets the log4j level to DEBUG.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Map-Balance-Reduce draft

2010-02-08 Thread Alan Gates

Jian,

Sorry if any of my questions or comments would have been answered by  
the diagrams, but apache lists don't allow attachments, so I can't see  
your diagrams.


If I understand correctly, your suggestion for balancing is to apply  
reduce on subsets of the hashed data, and then run reduce again on  
this reduced data set.  Is that correct?  If so, how does this differ  
from the combiner?  Second, some aggregation operations truly aren't  
algebraic (that is, they cannot be distributed across multiple  
iterations of reduce).   An example of this is session analysis, where  
the algorithm truly needs to see all operations together to analyze  
the user session.  How do you propose to handle that case?


Alan.

On Feb 7, 2010, at 11:25 PM, jian yi wrote:


Two targets:
1. Solving the skew problem
2. Regarding a task as a timeslice to improve on scheduler,  
switching a job to another job by timeslice.


In MR (Map-Reduce) model, reducings are not balanced, because the  
scale of partitiones are unbalanced. How to balance? We can control  
the size of partition, rehash the bigger parition and combine to the  
specified size. If a key has many values, it's necessary to execute  
mapreduce twice.The following is the model digram:


Scheduler can regard a task as a timeslice similarly OS scheduler.
If a split is bigger than a specified size, it will be splitted  
again. If a split is smaller than a specified size, it will be  
combined with others, we can name the combining procedure regroup.  
The combining is logic, it's not necessay to combine these smaller  
splits to a disk file, which will not affect the performance.The  
target is that every task spent same time running.






[jira] Created: (MAPREDUCE-1468) Too many log messages generated if a node has a different Hadoop version installed

2010-02-08 Thread Qi Liu (JIRA)
Too many log messages generated if a node has a different Hadoop version 
installed
--

 Key: MAPREDUCE-1468
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1468
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 0.20.1
Reporter: Qi Liu


When upgrading or downgrading Hadoop, for a large cluster it is possible that 
some nodes are not upgraded/downgraded. In this case, the jobtracker and 
namenode generates gigabytes of logs each day, reporting protocol version 
mismatch. Such log message is generated once every 3ms.

Ideally, the misbehaved node should quit retrying after several attempts if 
there is a protocol version mismatch. At the very least, the log message should 
be minimized (maybe once a day or once per start up).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1470) Move Delegation token into Common so that we can use it for MapReduce also

2010-02-08 Thread Owen O'Malley (JIRA)
Move Delegation token into Common so that we can use it for MapReduce also
--

 Key: MAPREDUCE-1470
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1470
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Owen O'Malley
Assignee: Owen O'Malley


We need to update one reference for map/reduce when we move the hdfs delegation 
tokens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Map-Balance-Reduce draft

2010-02-08 Thread jian yi
Hi Alan,

In my opinion, MBR solves the skew problem with minimum cost. It is differ
with combiner. My English is poor, but I do my best to express myself
clearly.

In MBR, we can set the size of a task, both map task and reduce task. For
example, 120~150MB input for a task, the motive is that we can control a
task's run-time to keep the size of every task is almost equal, which is the
precondition that a task can be regarded as a timeslice switched.

By hashing output of map to more splits, we can regroup smaller splits to a
new split which size is specified and hash bigger spits to more small
splits, until the size of all splits is within the specified range.

Balance interface is like map and reduce interface. Balance interface shoud
be implemented when a single key has too many values, because the key will
be hashed to more than one splits. In the case, we can't get the final
results in a MBR session, it is necessary to start a next MBR session. For a
same key, we can hash it with key+value, the action will be telled to
balance interface.

Regards
Jian Yi

2010/2/9 Alan Gates ga...@yahoo-inc.com

 Jian,

 Sorry if any of my questions or comments would have been answered by the
 diagrams, but apache lists don't allow attachments, so I can't see your
 diagrams.

 If I understand correctly, your suggestion for balancing is to apply reduce
 on subsets of the hashed data, and then run reduce again on this reduced
 data set.  Is that correct?  If so, how does this differ from the combiner?
  Second, some aggregation operations truly aren't algebraic (that is, they
 cannot be distributed across multiple iterations of reduce).   An example of
 this is session analysis, where the algorithm truly needs to see all
 operations together to analyze the user session.  How do you propose to
 handle that case?

 Alan.


 On Feb 7, 2010, at 11:25 PM, jian yi wrote:

  Two targets:
 1. Solving the skew problem
 2. Regarding a task as a timeslice to improve on scheduler, switching a
 job to another job by timeslice.

 In MR (Map-Reduce) model, reducings are not balanced, because the scale of
 partitiones are unbalanced. How to balance? We can control the size of
 partition, rehash the bigger parition and combine to the specified size. If
 a key has many values, it's necessary to execute mapreduce twice.The
 following is the model digram:

 Scheduler can regard a task as a timeslice similarly OS scheduler.
 If a split is bigger than a specified size, it will be splitted again. If
 a split is smaller than a specified size, it will be combined with others,
 we can name the combining procedure regroup. The combining is logic, it's
 not necessay to combine these smaller splits to a disk file, which will not
 affect the performance.The target is that every task spent same time
 running.