>
> It shouldn't be specific to text files, the same should happen with binary
> files


What is the expected notion of a line/line number in a binary file?


On Mon, Dec 30, 2013 at 8:27 AM, Aureliano Buendia <[email protected]>wrote:

>
>
>
> On Mon, Dec 30, 2013 at 4:24 PM, Michael (Bach) Bui <[email protected]>wrote:
>
>> Note that, Spark use HDFS API to access the file.
>> HDFS API has KeyValueTextInputFormat that addresses Aureliano’s
>> requirement.
>>
>
> It shouldn't be specific to text files, the same should happen with binary
> files.
>
>
>>
>> I am just not sure it KeyValueTextInputFormat has been pulled into the
>> latest version of spark yet.
>> Without that, it may be messy to make sure that the partition boundary is
>> a new line character.
>>
>> I think this usage pattern is important, if it is not yet available, I
>> can try to pull it in.
>>
>
> I agree. It'd be super useful to have this feature.
>
>
>>
>> --------------------------------------------
>> Michael (Bach) Bui, PhD,
>> Senior Staff Architect, ADATAO Inc.
>> www.adatao.com
>>
>>
>>
>>
>> On Dec 30, 2013, at 6:28 AM, Aureliano Buendia <[email protected]>
>> wrote:
>>
>> Hi,
>>
>> When reading a simple text file in spark, what's the best way of mapping
>> each line to (line number, line)? RDD doesn't seem to have an equivalent of
>> zipWithIndex.
>>
>>
>>
>

Reply via email to