That doesn't work since casting CombineFileSplit to FileSplit is gives ClassCastException:
Error: java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.CombineFileSplit cannot be cast to org.apache.hadoop.mapreduce.lib.input.FileSplit at com.tagged.ramblas.archive.ArchiveCrunchFns$FromEvent2RedhshiftWithPartitionInfoFn.process(ArchiveCrunchFns.java:205) a On Thu, Nov 10, 2016 at 10:30 AM, David Ortiz <[email protected]> wrote: > Looks like what I was doing was: > > > > String loc = ((FileSplit) ((Supplier<InputSplit>) ((MapContext) > *this*.getContext()).getInputSplit()).get()).getPath().toString(); > > > > > > I believe this was occurring on combined files, but it’s been awhile, so I > am not 100% sure. > > > > *From:* Marcin Michalski [mailto:[email protected]] > *Sent:* Thursday, November 10, 2016 1:22 PM > *To:* [email protected] > *Subject:* Re: Figuring out to which CombineFileSplit the input record of > DoFn process each record belongs to > > > > I have taken a look but don't see anything that would give me access to > the input record's dir location. Could you point me in the right direction? > > > On Nov 9, 2016, at 7:50 PM, David Ortiz <[email protected]> wrote: > > I can look up the exact methods in the morning, but in short, the DoFn > does have a way to grab the TaskContext object when running using > MapReduce. From there you can get the split. > > > > *Sent from my Verizon Wireless 4G LTE DROID* > > On Nov 9, 2016 9:56 PM, Marcin Michalski <[email protected]> wrote: > > Hi, is it possible to tie each record from DoFn's process method's to the > single input split from CombinedFileSplit? I basically want to get some > info from the input HDFS directory of the input split (/data/*20161109/11*) > and use it to enhance a each record that is being read by process method. I > was able to hack the access issue of CrunchInputSplit by using reflection > but then I am not sure how to tie each input record to one input split > since my job reads multiple files from different directories that have > date/hour information that I need. > > > > > > @Override > > public void process(GenericData.Record eventRecord, Emitter<Pair<String, > GenericData.Record>> pairEmitter) { > > if(getContext() instanceof MapContext) { > > InputSplit inputSplit = ((MapContext) > getContext()).getInputSplit(); > > Class<? extends InputSplit> splitClass = > inputSplit.getClass(); > > > > try { > > Method getInputSplitMethod = splitClass > > .getDeclaredMethod("get"); > > getInputSplitMethod.setAccessible(true); > > CombineFileSplit fileSplit = (CombineFileSplit) > getInputSplitMethod.invoke(inputSplit); > > > > System.out.println("number of input files: " + > fileSplit.getPaths().length); > > int index = 0; > > for(Path p: fileSplit.getPaths()) { > > System.out.println("split length: " + > fileSplit.getLength(index) + " partition: " > > + getPartitionDt(fileSplit. > getPath(index))); > > index ++; > > } > > } catch (Exception e) { > > System.out.println("we have a problem"); > > e.printStackTrace(); > > } > > } > > } > > > > ...now I want to output a pair of of Partition info YYYYMMDDHH and some > modified avro record. Any idea how I can get the directory information of > the inputsplit that is being processed by each call of the process method? > > ... > > emit(Pair.of(partition, some_avro_record))); > > > > I know that I could disable the combined input file format but I don't > want to do that > > > > Thanks! > > -- > > Marcin Michalski | Big Data Engineer > > [email protected] <[email protected]> | (917) 478-9422 (c) > > <http://www.ifwe.co/> > > Tagged, Inc. is now if(we). Learn more at ifwe.co > > *This email is intended only for the use of the individual(s) to whom it > is addressed. If you have received this communication in error, please > immediately notify the sender and delete the original email.* > > *Disclaimer* > > The information contained in this communication from the sender is > confidential. It is intended solely for use by the recipient and others > authorized to receive it. If you are not the recipient, you are hereby > notified that any disclosure, copying, distribution or taking action in > relation of the contents of this information is strictly prohibited and may > be unlawful. > > This email has been scanned for viruses and malware, and may have been > automatically archived by *Mimecast Ltd*, an innovator in Software as a > Service (SaaS) for business. Providing a *safer* and *more useful* place > for your human generated data. Specializing in; Security, archiving and > compliance. To find out more Click Here > <http://www.mimecast.com/products/>. > > *This email is intended only for the use of the individual(s) to whom it > is addressed. If you have received this communication in error, please > immediately notify the sender and delete the original email.* > > *Disclaimer* > > The information contained in this communication from the sender is > confidential. It is intended solely for use by the recipient and others > authorized to receive it. If you are not the recipient, you are hereby > notified that any disclosure, copying, distribution or taking action in > relation of the contents of this information is strictly prohibited and may > be unlawful. > > This email has been scanned for viruses and malware, and may have been > automatically archived by *Mimecast Ltd*, an innovator in Software as a > Service (SaaS) for business. Providing a *safer* and *more useful* place > for your human generated data. Specializing in; Security, archiving and > compliance. To find out more Click Here > <http://www.mimecast.com/products/>. > -- Marcin Michalski | Big Data Engineer [email protected] <[email protected]> | (917) 478-9422 (c) <http://www.ifwe.co/> Tagged, Inc. is now if(we). Learn more at ifwe.co
