On Tue, May 5, 2015 at 9:08 AM, Jeff Quinn <[email protected]> wrote:
> Josh, > > Thanks so much for your response, you’re correct I hit the error while > using the MemPipeline. The difference between Source and ReadableSource > makes much more sense to me now. > > It sounds like I just need to implement ReadableSource and override the > #read and #asReadable methods with behavior that is equivalent to how my > `InputFormat` would act. Then I should be able to use my `InputFormat` in > my test suite with MemPipeline, and in my real pipeline I can rest assured > those methods will never be called. > That will work, but I still think the right thing to do is to make those formattedFile impls support ReadableSource. And there are definitely places in the MRPipeline and MemPipeline where ReadableSources would be useful w/formattedFiles (e.g., mapside joins) that we don't support right now. > > Best, > > Jeff > > On May 4, 2015, at 11:53 PM, Josh Wills <[email protected]> wrote: > > Any Source<T> can be used as the input to an MR/Spark job via > Pipeline.read, but a ReadableSource<T> can read data into the local client > as well-- I'm assuming you're hitting an error trying to use your > formattedFile source w/a MemPipeline job? MemPipeline requires > ReadableSources since everything it does runs client-side, while MRPipeline > and SparkPipeline are happy to use regular Sources, like the one returned > by formattedFile. > > The next question you would ask is "why doesn't formattedFile return a > ReadableSource<T>?" -- and it's a good one. I don't remember if there's a > good reason for it or if I was just being lazy. Will take a look and report > back. > > J > > On Tue, May 5, 2015 at 8:38 AM, Jeff Quinn <[email protected]> wrote: > >> Hello, >> >> I would like to know the byte offset (absolute offset, not relative to >> split) for each record inside of my crunch pipeline. >> >> My planned approach is to use a custom `InputFormat` class. >> >> I have tried tried using `From#formattedFile` to apply a custom >> `InputFormat` class, however the returned class does not implement >> `ReadableSource`, and thus cannot be used as a parameter for >> `Pipeline#read`. >> >> What is the purpose of the `From#formattedFile` method if the Source >> class it returns output cannot actually be read? Is using a custom >> `InputFormat` class possible or recommended? >> >> Thanks, >> >> Jeff Quinn >> Data Engineer >> Nuna >> >> *DISCLAIMER:* The contents of this email, including any attachments, may >> contain information that is confidential, proprietary in nature, protected >> health information (PHI), or otherwise protected by law from disclosure, >> and is solely for the use of the intended recipient(s). If you are not the >> intended recipient, you are hereby notified that any use, disclosure or >> copying of this email, including any attachments, is unauthorized and >> strictly prohibited. If you have received this email in error, please >> notify the sender of this email. Please delete this and all copies of this >> email from your system. Any opinions either expressed or implied in this >> email and all attachments, are those of its author only, and do not >> necessarily reflect those of Nuna Health, Inc. > > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com/> > Twitter: @josh_wills <http://twitter.com/josh_wills> > > > > *DISCLAIMER:* The contents of this email, including any attachments, may > contain information that is confidential, proprietary in nature, protected > health information (PHI), or otherwise protected by law from disclosure, > and is solely for the use of the intended recipient(s). If you are not the > intended recipient, you are hereby notified that any use, disclosure or > copying of this email, including any attachments, is unauthorized and > strictly prohibited. If you have received this email in error, please > notify the sender of this email. Please delete this and all copies of this > email from your system. Any opinions either expressed or implied in this > email and all attachments, are those of its author only, and do not > necessarily reflect those of Nuna Health, Inc. > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
