Re: How to map each line to (line number, line)?

Aureliano Buendia Mon, 30 Dec 2013 09:37:03 -0800

On Mon, Dec 30, 2013 at 5:19 PM, Aaron Davidson <[email protected]> wrote:


> It shouldn't be specific to text files, the same should happen with binary
>> files
>
>
> What is the expected notion of a line/line number in a binary file?
>

Element/ index of element in an array.


>
>
> On Mon, Dec 30, 2013 at 8:27 AM, Aureliano Buendia 
> <[email protected]>wrote:
>
>>
>>
>>
>> On Mon, Dec 30, 2013 at 4:24 PM, Michael (Bach) Bui 
>> <[email protected]>wrote:
>>
>>> Note that, Spark use HDFS API to access the file.
>>> HDFS API has KeyValueTextInputFormat that addresses Aureliano’s
>>> requirement.
>>>
>>
>> It shouldn't be specific to text files, the same should happen with
>> binary files.
>>
>>
>>>
>>> I am just not sure it KeyValueTextInputFormat has been pulled into the
>>> latest version of spark yet.
>>> Without that, it may be messy to make sure that the partition boundary
>>> is a new line character.
>>>
>>> I think this usage pattern is important, if it is not yet available, I
>>> can try to pull it in.
>>>
>>
>> I agree. It'd be super useful to have this feature.
>>
>>
>>>
>>> --------------------------------------------
>>> Michael (Bach) Bui, PhD,
>>> Senior Staff Architect, ADATAO Inc.
>>> www.adatao.com
>>>
>>>
>>>
>>>
>>> On Dec 30, 2013, at 6:28 AM, Aureliano Buendia <[email protected]>
>>> wrote:
>>>
>>> Hi,
>>>
>>> When reading a simple text file in spark, what's the best way of mapping
>>> each line to (line number, line)? RDD doesn't seem to have an equivalent of
>>> zipWithIndex.
>>>
>>>
>>>
>>
>

Re: How to map each line to (line number, line)?

Reply via email to