Re: Retrieving Input File Name with MRPipeline

Josh Wills Mon, 22 Jun 2015 13:00:07 -0700

The InputSplit on the MapContext implements the InputSupplier interface,
which allows you to get the underlying FileSplit that the map task is
processing. So you have to do a bunch of casting, but you can get at it.


On Monday, June 22, 2015, David Ortiz <[email protected]> wrote:

> Gave it a shot in the following MapFn, but it seems to always return null.
>
> new MapFn<String, Pair<String, String>>() {
>
>    private static final long serialVersionUID = 1L;
>    int min = minColumns;
>    int max = maxColumns;
>
>    @Override
>    public Pair<String, String> map(String input) {
>       //int columns = StringUtils.countMatches(input, "\t") + 1;
>       int columns = input.split("\t").length;
>       if (columns >= min && columns <= max) {
>          StringBuilder output = new StringBuilder(input);
>          output.append('\t');
>          String loc = 
> this.getContext().getConfiguration().get(TaskInputOutputContext.MAP_INPUT_FILE);
>          output.append(loc);
>          return new Pair<>(output.toString(), null);
>       } else {
>          return new Pair<>(null, input);
>       }
>    }
>
> }
>
>
> Also tried setting crunch.disable.combine.file to true figuring that combine 
> files might mess with it.  No dice.  Does anything look suspect in that 
> snippet?
>
>
> Thanks,
>
>     Dave
>
>
> On Mon, Jun 22, 2015 at 2:41 PM Micah Whitacre <[email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>
>> The DoFn should give you access to the TaskInputOutputContext[1] which
>> should contain that information.  I believe the context then should hold
>> the file as a config like "MAP_INPUT_FILE".  I haven't really tested
>> this out so definitely verify.
>>
>>
>> [1] -
>> https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/TaskInputOutputContext.html
>>
>> On Mon, Jun 22, 2015 at 1:28 PM, David Ortiz <[email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>
>>> Hello,
>>>
>>>       Is there a way in my crunch pipeline that I can retrieve the file
>>> name of the input file for my MapFn?  This function is definitely applied
>>> as a Mapper, so I think it should be possible, just having some difficulty
>>> working through the exact method of doing so.
>>>
>>> Thanks,
>>>       Dave
>>>
>>
>>

-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Retrieving Input File Name with MRPipeline

Reply via email to