Re: Retrieving Input File Name with MRPipeline

David Ortiz Mon, 22 Jun 2015 12:40:29 -0700

Gave it a shot in the following MapFn, but it seems to always return null.

new MapFn<String, Pair<String, String>>() {


   private static final long serialVersionUID = 1L;
   int min = minColumns;
   int max = maxColumns;

   @Override
   public Pair<String, String> map(String input) {
      //int columns = StringUtils.countMatches(input, "\t") + 1;
      int columns = input.split("\t").length;
      if (columns >= min && columns <= max) {
         StringBuilder output = new StringBuilder(input);
         output.append('\t');
         String loc =
this.getContext().getConfiguration().get(TaskInputOutputContext.MAP_INPUT_FILE);
         output.append(loc);
         return new Pair<>(output.toString(), null);
      } else {
         return new Pair<>(null, input);
      }
   }

}


Also tried setting crunch.disable.combine.file to true figuring that
combine files might mess with it.  No dice.  Does anything look
suspect in that snippet?


Thanks,

    Dave


On Mon, Jun 22, 2015 at 2:41 PM Micah Whitacre <[email protected]> wrote:

> The DoFn should give you access to the TaskInputOutputContext[1] which
> should contain that information.  I believe the context then should hold
> the file as a config like "MAP_INPUT_FILE".  I haven't really tested this
> out so definitely verify.
>
>
> [1] -
> https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/TaskInputOutputContext.html
>
> On Mon, Jun 22, 2015 at 1:28 PM, David Ortiz <[email protected]> wrote:
>
>> Hello,
>>
>>       Is there a way in my crunch pipeline that I can retrieve the file
>> name of the input file for my MapFn?  This function is definitely applied
>> as a Mapper, so I think it should be possible, just having some difficulty
>> working through the exact method of doing so.
>>
>> Thanks,
>>       Dave
>>
>
>

Re: Retrieving Input File Name with MRPipeline

Reply via email to