On Mon, Dec 30, 2013 at 5:19 PM, Aaron Davidson <[email protected]> wrote:
> It shouldn't be specific to text files, the same should happen with binary >> files > > > What is the expected notion of a line/line number in a binary file? > Element/ index of element in an array. > > > On Mon, Dec 30, 2013 at 8:27 AM, Aureliano Buendia > <[email protected]>wrote: > >> >> >> >> On Mon, Dec 30, 2013 at 4:24 PM, Michael (Bach) Bui >> <[email protected]>wrote: >> >>> Note that, Spark use HDFS API to access the file. >>> HDFS API has KeyValueTextInputFormat that addresses Aureliano’s >>> requirement. >>> >> >> It shouldn't be specific to text files, the same should happen with >> binary files. >> >> >>> >>> I am just not sure it KeyValueTextInputFormat has been pulled into the >>> latest version of spark yet. >>> Without that, it may be messy to make sure that the partition boundary >>> is a new line character. >>> >>> I think this usage pattern is important, if it is not yet available, I >>> can try to pull it in. >>> >> >> I agree. It'd be super useful to have this feature. >> >> >>> >>> -------------------------------------------- >>> Michael (Bach) Bui, PhD, >>> Senior Staff Architect, ADATAO Inc. >>> www.adatao.com >>> >>> >>> >>> >>> On Dec 30, 2013, at 6:28 AM, Aureliano Buendia <[email protected]> >>> wrote: >>> >>> Hi, >>> >>> When reading a simple text file in spark, what's the best way of mapping >>> each line to (line number, line)? RDD doesn't seem to have an equivalent of >>> zipWithIndex. >>> >>> >>> >> >
