Re: Byte Offset for Records

Josh Wills Tue, 05 May 2015 12:58:18 -0700

https://issues.apache.org/jira/browse/CRUNCH-517


Patch is up. I also fixed that stupid crunch-spark compile error on
hadoop1. I so cannot wait to get rid of hadoop1. :)

J

On Tue, May 5, 2015 at 8:21 AM, Jeff Quinn <[email protected]> wrote:

> Great. I would definitely agree, that sounds ideal.
>
> Thanks,
>
> Jeff
>
>
>
> On May 5, 2015, at 12:14 AM, Josh Wills <[email protected]> wrote:
>
>
> On Tue, May 5, 2015 at 9:08 AM, Jeff Quinn <[email protected]> wrote:
>
>> Josh,
>>
>> Thanks so much for your response, you’re correct I hit the error while
>> using the MemPipeline. The difference between Source and ReadableSource
>> makes much more sense to me now.
>>
>> It sounds like I just need to implement ReadableSource and override the
>> #read and #asReadable methods with behavior that is equivalent to how my
>> `InputFormat`  would act. Then I should be able to use my `InputFormat` in
>> my test suite with MemPipeline, and in my real pipeline I can rest assured
>> those methods will never be called.
>>
>
> That will work, but I still think the right thing to do is to make those
> formattedFile impls support ReadableSource. And there are definitely places
> in the MRPipeline and MemPipeline where ReadableSources would be useful
> w/formattedFiles (e.g., mapside joins) that we don't support right now.
>
>
>>
>> Best,
>>
>> Jeff
>>
>> On May 4, 2015, at 11:53 PM, Josh Wills <[email protected]> wrote:
>>
>> Any Source<T> can be used as the input to an MR/Spark job via
>> Pipeline.read, but a ReadableSource<T> can read data into the local client
>> as well-- I'm assuming you're hitting an error trying to use your
>> formattedFile source w/a MemPipeline job? MemPipeline requires
>> ReadableSources since everything it does runs client-side, while MRPipeline
>> and SparkPipeline are happy to use regular Sources, like the one returned
>> by formattedFile.
>>
>> The next question you would ask is "why doesn't formattedFile return a
>> ReadableSource<T>?" -- and it's a good one. I don't remember if there's a
>> good reason for it or if I was just being lazy. Will take a look and report
>> back.
>>
>> J
>>
>> On Tue, May 5, 2015 at 8:38 AM, Jeff Quinn <[email protected]> wrote:
>>
>>> Hello,
>>>
>>> I would like to know the byte offset (absolute offset, not relative to
>>> split) for each record inside of my crunch pipeline.
>>>
>>> My planned approach is to use a custom `InputFormat` class.
>>>
>>> I have tried tried using `From#formattedFile` to apply a custom
>>> `InputFormat` class, however the returned class does not implement
>>> `ReadableSource`, and thus cannot be used as a parameter for
>>> `Pipeline#read`.
>>>
>>> What is the purpose of the `From#formattedFile` method if the Source
>>> class it returns output cannot actually be read? Is using a custom
>>> `InputFormat` class possible or recommended?
>>>
>>> Thanks,
>>>
>>> Jeff Quinn
>>> Data Engineer
>>> Nuna
>>>
>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>> may contain information that is confidential, proprietary in nature,
>>> protected health information (PHI), or otherwise protected by law from
>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>> are not the intended recipient, you are hereby notified that any use,
>>> disclosure or copying of this email, including any attachments, is
>>> unauthorized and strictly prohibited. If you have received this email in
>>> error, please notify the sender of this email. Please delete this and all
>>> copies of this email from your system. Any opinions either expressed or
>>> implied in this email and all attachments, are those of its author only,
>>> and do not necessarily reflect those of Nuna Health, Inc.
>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com/>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>>
>>
>> *DISCLAIMER:* The contents of this email, including any attachments, may
>> contain information that is confidential, proprietary in nature, protected
>> health information (PHI), or otherwise protected by law from disclosure,
>> and is solely for the use of the intended recipient(s). If you are not the
>> intended recipient, you are hereby notified that any use, disclosure or
>> copying of this email, including any attachments, is unauthorized and
>> strictly prohibited. If you have received this email in error, please
>> notify the sender of this email. Please delete this and all copies of this
>> email from your system. Any opinions either expressed or implied in this
>> email and all attachments, are those of its author only, and do not
>> necessarily reflect those of Nuna Health, Inc.
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com/>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>
>
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Byte Offset for Records

Reply via email to