[jira] [Updated] (MAPREDUCE-6423) MapOutput Sampler

2015-08-21 Thread Chris Douglas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Douglas updated MAPREDUCE-6423:
-
Status: Open  (was: Patch Available)

 MapOutput Sampler
 -

 Key: MAPREDUCE-6423
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6423
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Ram Manohar Bheemana
Assignee: Ram Manohar Bheemana
Priority: Minor
 Attachments: MapOutputSampler.java


 Need a sampler based on the MapOutput Keys. Current InputSampler 
 implementation has a major drawback which is input and output of a mapper 
 should be same, generally this isn't the case.
 approach:
 1. Create a Sampler which samples the data based on the input.
 2. Run a small map reduce in uber task mode using the original job mapper and 
 identity reducer to generate required MapOutputSample keys
 3. Optionally, we can input the input file to be sample. For example inputs 
 files A, B; we should be able to specify to use only file A for sampling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6423) MapOutput Sampler

2015-07-20 Thread Ram Manohar Bheemana (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Manohar Bheemana updated MAPREDUCE-6423:

Status: Patch Available  (was: In Progress)

Please review the attached MapOutputSampler.java

 MapOutput Sampler
 -

 Key: MAPREDUCE-6423
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6423
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Ram Manohar Bheemana
Assignee: Ram Manohar Bheemana
Priority: Minor
 Attachments: MapOutputSampler.java


 Need a sampler based on the MapOutput Keys. Current InputSampler 
 implementation has a major drawback which is input and output of a mapper 
 should be same, generally this isn't the case.
 approach:
 1. Create a Sampler which samples the data based on the input.
 2. Run a small map reduce in uber task mode using the original job mapper and 
 identity reducer to generate required MapOutputSample keys
 3. Optionally, we can input the input file to be sample. For example inputs 
 files A, B; we should be able to specify to use only file A for sampling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6423) MapOutput Sampler

2015-07-20 Thread Ram Manohar Bheemana (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Manohar Bheemana updated MAPREDUCE-6423:

Attachment: MapOutputSampler.java

 MapOutput Sampler
 -

 Key: MAPREDUCE-6423
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6423
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Ram Manohar Bheemana
Assignee: Ram Manohar Bheemana
Priority: Minor
 Attachments: MapOutputSampler.java


 Need a sampler based on the MapOutput Keys. Current InputSampler 
 implementation has a major drawback which is input and output of a mapper 
 should be same, generally this isn't the case.
 approach:
 1. Create a Sampler which samples the data based on the input.
 2. Run a small map reduce in uber task mode using the original job mapper and 
 identity reducer to generate required MapOutputSample keys
 3. Optionally, we can input the input file to be sample. For example inputs 
 files A, B; we should be able to specify to use only file A for sampling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)