> > It shouldn't be specific to text files, the same should happen with binary > files
What is the expected notion of a line/line number in a binary file? On Mon, Dec 30, 2013 at 8:27 AM, Aureliano Buendia <[email protected]>wrote: > > > > On Mon, Dec 30, 2013 at 4:24 PM, Michael (Bach) Bui <[email protected]>wrote: > >> Note that, Spark use HDFS API to access the file. >> HDFS API has KeyValueTextInputFormat that addresses Aureliano’s >> requirement. >> > > It shouldn't be specific to text files, the same should happen with binary > files. > > >> >> I am just not sure it KeyValueTextInputFormat has been pulled into the >> latest version of spark yet. >> Without that, it may be messy to make sure that the partition boundary is >> a new line character. >> >> I think this usage pattern is important, if it is not yet available, I >> can try to pull it in. >> > > I agree. It'd be super useful to have this feature. > > >> >> -------------------------------------------- >> Michael (Bach) Bui, PhD, >> Senior Staff Architect, ADATAO Inc. >> www.adatao.com >> >> >> >> >> On Dec 30, 2013, at 6:28 AM, Aureliano Buendia <[email protected]> >> wrote: >> >> Hi, >> >> When reading a simple text file in spark, what's the best way of mapping >> each line to (line number, line)? RDD doesn't seem to have an equivalent of >> zipWithIndex. >> >> >> >
