The InputSplit on the MapContext implements the InputSupplier interface, which allows you to get the underlying FileSplit that the map task is processing. So you have to do a bunch of casting, but you can get at it.
On Monday, June 22, 2015, David Ortiz <[email protected]> wrote: > Gave it a shot in the following MapFn, but it seems to always return null. > > new MapFn<String, Pair<String, String>>() { > > private static final long serialVersionUID = 1L; > int min = minColumns; > int max = maxColumns; > > @Override > public Pair<String, String> map(String input) { > //int columns = StringUtils.countMatches(input, "\t") + 1; > int columns = input.split("\t").length; > if (columns >= min && columns <= max) { > StringBuilder output = new StringBuilder(input); > output.append('\t'); > String loc = > this.getContext().getConfiguration().get(TaskInputOutputContext.MAP_INPUT_FILE); > output.append(loc); > return new Pair<>(output.toString(), null); > } else { > return new Pair<>(null, input); > } > } > > } > > > Also tried setting crunch.disable.combine.file to true figuring that combine > files might mess with it. No dice. Does anything look suspect in that > snippet? > > > Thanks, > > Dave > > > On Mon, Jun 22, 2015 at 2:41 PM Micah Whitacre <[email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > >> The DoFn should give you access to the TaskInputOutputContext[1] which >> should contain that information. I believe the context then should hold >> the file as a config like "MAP_INPUT_FILE". I haven't really tested >> this out so definitely verify. >> >> >> [1] - >> https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/TaskInputOutputContext.html >> >> On Mon, Jun 22, 2015 at 1:28 PM, David Ortiz <[email protected] >> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >> >>> Hello, >>> >>> Is there a way in my crunch pipeline that I can retrieve the file >>> name of the input file for my MapFn? This function is definitely applied >>> as a Mapper, so I think it should be possible, just having some difficulty >>> working through the exact method of doing so. >>> >>> Thanks, >>> Dave >>> >> >> -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
