Images is not visible of gmail?//Map-Balance-Reduce draft
Two targets: 1. Solving the skew problem 2. Regarding a task as a timeslice to improve on scheduler, switching a job to another job by timeslice. In MR (Map-Reduce) model, reducings are not balanced, because the scale of partitiones are unbalanced. How to balance? We can control the size of partition, rehash the bigger parition and combine to the specified size. If a key has many values, it's necessary to execute mapreduce twice.The following is the model digram: mbr1.jpg (attachment) Scheduler can regard a task as a timeslice similarly OS scheduler. If a split is bigger than a specified size, it will be splitted again. If a split is smaller than a specified size, it will be combined with others, we can name the combining procedure regroup. The combining is logic, it's not necessay to combine these smaller splits to a disk file, which will not affect the performance.The target is that every task spent same time running. mbr2.jpg (attachment)
[jira] Reopened: (MAPREDUCE-927) Cleanup of task-logs should happen in TaskTracker instead of the Child
[ https://issues.apache.org/jira/browse/MAPREDUCE-927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amareshwari Sriramadasu reopened MAPREDUCE-927: --- Cleanup of task-logs should happen in TaskTracker instead of the Child -- Key: MAPREDUCE-927 URL: https://issues.apache.org/jira/browse/MAPREDUCE-927 Project: Hadoop Map/Reduce Issue Type: Bug Components: tasktracker Affects Versions: 0.21.0 Reporter: Vinod K V Priority: Blocker Fix For: 0.21.0 Task logs' cleanup is being done in Child now. This is undesirable atleast for two reasons: 1) failures while cleaning up will affect the user's tasks, and 2) the task's wall time will get affected due to operations that TT actually should own. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Images is not visible of gmail?//Map-Balance-Reduce draft
Hi, jian No attachments can be seen in gmail. On Mon, Feb 8, 2010 at 5:23 PM, jian yi eyj...@gmail.com wrote: Two targets: 1. Solving the skew problem 2. Regarding a task as a timeslice to improve on scheduler, switching a job to another job by timeslice. In MR (Map-Reduce) model, reducings are not balanced, because the scale of partitiones are unbalanced. How to balance? We can control the size of partition, rehash the bigger parition and combine to the specified size. If a key has many values, it's necessary to execute mapreduce twice.The following is the model digram: mbr1.jpg (attachment) Scheduler can regard a task as a timeslice similarly OS scheduler. If a split is bigger than a specified size, it will be splitted again. If a split is smaller than a specified size, it will be combined with others, we can name the combining procedure regroup. The combining is logic, it's not necessay to combine these smaller splits to a disk file, which will not affect the performance.The target is that every task spent same time running. mbr2.jpg (attachment)
[jira] Created: (MAPREDUCE-1466) FileInputFormat should save #input-files in JobConf
FileInputFormat should save #input-files in JobConf --- Key: MAPREDUCE-1466 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1466 Project: Hadoop Map/Reduce Issue Type: Improvement Components: client Reporter: Arun C Murthy Assignee: Arun C Murthy Priority: Minor Fix For: 0.22.0 We already track the amount of data consumed by MR applications (MAP_INPUT_BYTES), alongwith, it would be useful to #input-files from the client-side for analysis. Along the lines of MAPREDUCE-1403, it would be easy to stick in the JobConf during job-submission. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAPREDUCE-1467) Add a --verbose flag to Sqoop
Add a --verbose flag to Sqoop - Key: MAPREDUCE-1467 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1467 Project: Hadoop Map/Reduce Issue Type: Improvement Components: contrib/sqoop Reporter: Aaron Kimball Assignee: Aaron Kimball Priority: Minor Attachments: MAPREDUCE-1467.patch Need a {{--verbose}} flag that sets the log4j level to DEBUG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Map-Balance-Reduce draft
Jian, Sorry if any of my questions or comments would have been answered by the diagrams, but apache lists don't allow attachments, so I can't see your diagrams. If I understand correctly, your suggestion for balancing is to apply reduce on subsets of the hashed data, and then run reduce again on this reduced data set. Is that correct? If so, how does this differ from the combiner? Second, some aggregation operations truly aren't algebraic (that is, they cannot be distributed across multiple iterations of reduce). An example of this is session analysis, where the algorithm truly needs to see all operations together to analyze the user session. How do you propose to handle that case? Alan. On Feb 7, 2010, at 11:25 PM, jian yi wrote: Two targets: 1. Solving the skew problem 2. Regarding a task as a timeslice to improve on scheduler, switching a job to another job by timeslice. In MR (Map-Reduce) model, reducings are not balanced, because the scale of partitiones are unbalanced. How to balance? We can control the size of partition, rehash the bigger parition and combine to the specified size. If a key has many values, it's necessary to execute mapreduce twice.The following is the model digram: Scheduler can regard a task as a timeslice similarly OS scheduler. If a split is bigger than a specified size, it will be splitted again. If a split is smaller than a specified size, it will be combined with others, we can name the combining procedure regroup. The combining is logic, it's not necessay to combine these smaller splits to a disk file, which will not affect the performance.The target is that every task spent same time running.
[jira] Created: (MAPREDUCE-1468) Too many log messages generated if a node has a different Hadoop version installed
Too many log messages generated if a node has a different Hadoop version installed -- Key: MAPREDUCE-1468 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1468 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 0.20.1 Reporter: Qi Liu When upgrading or downgrading Hadoop, for a large cluster it is possible that some nodes are not upgraded/downgraded. In this case, the jobtracker and namenode generates gigabytes of logs each day, reporting protocol version mismatch. Such log message is generated once every 3ms. Ideally, the misbehaved node should quit retrying after several attempts if there is a protocol version mismatch. At the very least, the log message should be minimized (maybe once a day or once per start up). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAPREDUCE-1470) Move Delegation token into Common so that we can use it for MapReduce also
Move Delegation token into Common so that we can use it for MapReduce also -- Key: MAPREDUCE-1470 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1470 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Owen O'Malley Assignee: Owen O'Malley We need to update one reference for map/reduce when we move the hdfs delegation tokens. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Map-Balance-Reduce draft
Hi Alan, In my opinion, MBR solves the skew problem with minimum cost. It is differ with combiner. My English is poor, but I do my best to express myself clearly. In MBR, we can set the size of a task, both map task and reduce task. For example, 120~150MB input for a task, the motive is that we can control a task's run-time to keep the size of every task is almost equal, which is the precondition that a task can be regarded as a timeslice switched. By hashing output of map to more splits, we can regroup smaller splits to a new split which size is specified and hash bigger spits to more small splits, until the size of all splits is within the specified range. Balance interface is like map and reduce interface. Balance interface shoud be implemented when a single key has too many values, because the key will be hashed to more than one splits. In the case, we can't get the final results in a MBR session, it is necessary to start a next MBR session. For a same key, we can hash it with key+value, the action will be telled to balance interface. Regards Jian Yi 2010/2/9 Alan Gates ga...@yahoo-inc.com Jian, Sorry if any of my questions or comments would have been answered by the diagrams, but apache lists don't allow attachments, so I can't see your diagrams. If I understand correctly, your suggestion for balancing is to apply reduce on subsets of the hashed data, and then run reduce again on this reduced data set. Is that correct? If so, how does this differ from the combiner? Second, some aggregation operations truly aren't algebraic (that is, they cannot be distributed across multiple iterations of reduce). An example of this is session analysis, where the algorithm truly needs to see all operations together to analyze the user session. How do you propose to handle that case? Alan. On Feb 7, 2010, at 11:25 PM, jian yi wrote: Two targets: 1. Solving the skew problem 2. Regarding a task as a timeslice to improve on scheduler, switching a job to another job by timeslice. In MR (Map-Reduce) model, reducings are not balanced, because the scale of partitiones are unbalanced. How to balance? We can control the size of partition, rehash the bigger parition and combine to the specified size. If a key has many values, it's necessary to execute mapreduce twice.The following is the model digram: Scheduler can regard a task as a timeslice similarly OS scheduler. If a split is bigger than a specified size, it will be splitted again. If a split is smaller than a specified size, it will be combined with others, we can name the combining procedure regroup. The combining is logic, it's not necessay to combine these smaller splits to a disk file, which will not affect the performance.The target is that every task spent same time running.