[jira] [Commented] (MAPREDUCE-6423) MapOutput Sampler
[ https://issues.apache.org/jira/browse/MAPREDUCE-6423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742215#comment-14742215 ] Ram Manohar Bheemana commented on MAPREDUCE-6423: - Sorry for delay in response, will try to generate the patch as suggested. > MapOutput Sampler > - > > Key: MAPREDUCE-6423 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6423 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Ram Manohar Bheemana >Assignee: Ram Manohar Bheemana >Priority: Minor > Attachments: MapOutputSampler.java > > > Need a sampler based on the MapOutput Keys. Current InputSampler > implementation has a major drawback which is input and output of a mapper > should be same, generally this isn't the case. > approach: > 1. Create a Sampler which samples the data based on the input. > 2. Run a small map reduce in uber task mode using the original job mapper and > identity reducer to generate required MapOutputSample keys > 3. Optionally, we can input the input file to be sample. For example inputs > files A, B; we should be able to specify to use only file A for sampling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6423) MapOutput Sampler
[ https://issues.apache.org/jira/browse/MAPREDUCE-6423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707436#comment-14707436 ] Chris Douglas commented on MAPREDUCE-6423: -- Thanks for taking a look at this. That the sampler only works on input data was always a weakness for jobs requiring their output be totally ordered. Could you generate a patch? The contribution wiki is [here|http://wiki.apache.org/hadoop/HowToContribute]. It might be easier for others to use if the Mapper was integrated with the InputSampler, but a separate tool is still an improvement. MapOutput Sampler - Key: MAPREDUCE-6423 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6423 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Ram Manohar Bheemana Assignee: Ram Manohar Bheemana Priority: Minor Attachments: MapOutputSampler.java Need a sampler based on the MapOutput Keys. Current InputSampler implementation has a major drawback which is input and output of a mapper should be same, generally this isn't the case. approach: 1. Create a Sampler which samples the data based on the input. 2. Run a small map reduce in uber task mode using the original job mapper and identity reducer to generate required MapOutputSample keys 3. Optionally, we can input the input file to be sample. For example inputs files A, B; we should be able to specify to use only file A for sampling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)