Any Source<T> can be used as the input to an MR/Spark job via Pipeline.read, but a ReadableSource<T> can read data into the local client as well-- I'm assuming you're hitting an error trying to use your formattedFile source w/a MemPipeline job? MemPipeline requires ReadableSources since everything it does runs client-side, while MRPipeline and SparkPipeline are happy to use regular Sources, like the one returned by formattedFile.
The next question you would ask is "why doesn't formattedFile return a ReadableSource<T>?" -- and it's a good one. I don't remember if there's a good reason for it or if I was just being lazy. Will take a look and report back. J On Tue, May 5, 2015 at 8:38 AM, Jeff Quinn <[email protected]> wrote: > Hello, > > I would like to know the byte offset (absolute offset, not relative to > split) for each record inside of my crunch pipeline. > > My planned approach is to use a custom `InputFormat` class. > > I have tried tried using `From#formattedFile` to apply a custom > `InputFormat` class, however the returned class does not implement > `ReadableSource`, and thus cannot be used as a parameter for > `Pipeline#read`. > > What is the purpose of the `From#formattedFile` method if the Source class > it returns output cannot actually be read? Is using a custom `InputFormat` > class possible or recommended? > > Thanks, > > Jeff Quinn > Data Engineer > Nuna > > *DISCLAIMER:* The contents of this email, including any attachments, may > contain information that is confidential, proprietary in nature, protected > health information (PHI), or otherwise protected by law from disclosure, > and is solely for the use of the intended recipient(s). If you are not the > intended recipient, you are hereby notified that any use, disclosure or > copying of this email, including any attachments, is unauthorized and > strictly prohibited. If you have received this email in error, please > notify the sender of this email. Please delete this and all copies of this > email from your system. Any opinions either expressed or implied in this > email and all attachments, are those of its author only, and do not > necessarily reflect those of Nuna Health, Inc. -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
