https://issues.apache.org/jira/browse/CRUNCH-517
Patch is up. I also fixed that stupid crunch-spark compile error on hadoop1. I so cannot wait to get rid of hadoop1. :) J On Tue, May 5, 2015 at 8:21 AM, Jeff Quinn <[email protected]> wrote: > Great. I would definitely agree, that sounds ideal. > > Thanks, > > Jeff > > > > On May 5, 2015, at 12:14 AM, Josh Wills <[email protected]> wrote: > > > On Tue, May 5, 2015 at 9:08 AM, Jeff Quinn <[email protected]> wrote: > >> Josh, >> >> Thanks so much for your response, you’re correct I hit the error while >> using the MemPipeline. The difference between Source and ReadableSource >> makes much more sense to me now. >> >> It sounds like I just need to implement ReadableSource and override the >> #read and #asReadable methods with behavior that is equivalent to how my >> `InputFormat` would act. Then I should be able to use my `InputFormat` in >> my test suite with MemPipeline, and in my real pipeline I can rest assured >> those methods will never be called. >> > > That will work, but I still think the right thing to do is to make those > formattedFile impls support ReadableSource. And there are definitely places > in the MRPipeline and MemPipeline where ReadableSources would be useful > w/formattedFiles (e.g., mapside joins) that we don't support right now. > > >> >> Best, >> >> Jeff >> >> On May 4, 2015, at 11:53 PM, Josh Wills <[email protected]> wrote: >> >> Any Source<T> can be used as the input to an MR/Spark job via >> Pipeline.read, but a ReadableSource<T> can read data into the local client >> as well-- I'm assuming you're hitting an error trying to use your >> formattedFile source w/a MemPipeline job? MemPipeline requires >> ReadableSources since everything it does runs client-side, while MRPipeline >> and SparkPipeline are happy to use regular Sources, like the one returned >> by formattedFile. >> >> The next question you would ask is "why doesn't formattedFile return a >> ReadableSource<T>?" -- and it's a good one. I don't remember if there's a >> good reason for it or if I was just being lazy. Will take a look and report >> back. >> >> J >> >> On Tue, May 5, 2015 at 8:38 AM, Jeff Quinn <[email protected]> wrote: >> >>> Hello, >>> >>> I would like to know the byte offset (absolute offset, not relative to >>> split) for each record inside of my crunch pipeline. >>> >>> My planned approach is to use a custom `InputFormat` class. >>> >>> I have tried tried using `From#formattedFile` to apply a custom >>> `InputFormat` class, however the returned class does not implement >>> `ReadableSource`, and thus cannot be used as a parameter for >>> `Pipeline#read`. >>> >>> What is the purpose of the `From#formattedFile` method if the Source >>> class it returns output cannot actually be read? Is using a custom >>> `InputFormat` class possible or recommended? >>> >>> Thanks, >>> >>> Jeff Quinn >>> Data Engineer >>> Nuna >>> >>> *DISCLAIMER:* The contents of this email, including any attachments, >>> may contain information that is confidential, proprietary in nature, >>> protected health information (PHI), or otherwise protected by law from >>> disclosure, and is solely for the use of the intended recipient(s). If you >>> are not the intended recipient, you are hereby notified that any use, >>> disclosure or copying of this email, including any attachments, is >>> unauthorized and strictly prohibited. If you have received this email in >>> error, please notify the sender of this email. Please delete this and all >>> copies of this email from your system. Any opinions either expressed or >>> implied in this email and all attachments, are those of its author only, >>> and do not necessarily reflect those of Nuna Health, Inc. >> >> >> >> >> -- >> Director of Data Science >> Cloudera <http://www.cloudera.com/> >> Twitter: @josh_wills <http://twitter.com/josh_wills> >> >> >> >> *DISCLAIMER:* The contents of this email, including any attachments, may >> contain information that is confidential, proprietary in nature, protected >> health information (PHI), or otherwise protected by law from disclosure, >> and is solely for the use of the intended recipient(s). If you are not the >> intended recipient, you are hereby notified that any use, disclosure or >> copying of this email, including any attachments, is unauthorized and >> strictly prohibited. If you have received this email in error, please >> notify the sender of this email. Please delete this and all copies of this >> email from your system. Any opinions either expressed or implied in this >> email and all attachments, are those of its author only, and do not >> necessarily reflect those of Nuna Health, Inc. >> > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com/> > Twitter: @josh_wills <http://twitter.com/josh_wills> > > > > *DISCLAIMER:* The contents of this email, including any attachments, may > contain information that is confidential, proprietary in nature, protected > health information (PHI), or otherwise protected by law from disclosure, > and is solely for the use of the intended recipient(s). If you are not the > intended recipient, you are hereby notified that any use, disclosure or > copying of this email, including any attachments, is unauthorized and > strictly prohibited. If you have received this email in error, please > notify the sender of this email. Please delete this and all copies of this > email from your system. Any opinions either expressed or implied in this > email and all attachments, are those of its author only, and do not > necessarily reflect those of Nuna Health, Inc. > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
