This might be overkill for your needs, but the scodec parser combinator library might be useful for creating a parser.
https://github.com/scodec/scodec Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Thu, Apr 2, 2015 at 6:53 PM, java8964 <java8...@hotmail.com> wrote: > I think implementing your own InputFormat and using > SparkContext.hadoopFile() is the best option for your case. > > Yong > > ------------------------------ > From: kvi...@vt.edu > Date: Thu, 2 Apr 2015 17:31:30 -0400 > Subject: Re: Reading a large file (binary) into RDD > To: freeman.jer...@gmail.com > CC: user@spark.apache.org > > > The file has a specific structure. I outline it below. > > The input file is basically a representation of a graph. > > INT > INT (A) > LONG (B) > A INTs (Degrees) > A SHORTINTs (Vertex_Attribute) > B INTs > B INTs > B SHORTINTs > B SHORTINTs > > A - number of vertices > B - number of edges (note that the INTs/SHORTINTs associated with this are > edge attributes) > > After reading in the file, I need to create two RDDs (one with vertices > and the other with edges) > > On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman <freeman.jer...@gmail.com> > wrote: > > Hm, that will indeed be trickier because this method assumes records are > the same byte size. Is the file an arbitrary sequence of mixed types, or is > there structure, e.g. short, long, short, long, etc.? > > If you could post a gist with an example of the kind of file and how it > should look once read in that would be useful! > > ------------------------- > jeremyfreeman.net > @thefreemanlab > > On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan <kvi...@vt.edu> wrote: > > Thanks for the reply. Unfortunately, in my case, the binary file is a mix > of short and long integers. Is there any other way that could of use here? > > My current method happens to have a large overhead (much more than actual > computation time). Also, I am short of memory at the driver when it has to > read the entire file. > > On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman <freeman.jer...@gmail.com> > wrote: > > If it’s a flat binary file and each record is the same length (in bytes), > you can use Spark’s binaryRecords method (defined on the SparkContext), > which loads records from one or more large flat binary files into an RDD. > Here’s an example in python to show how it works: > > # write data from an array > from numpy import random > dat = random.randn(100,5) > f = open('test.bin', 'w') > f.write(dat) > f.close() > > > # load the data back in > > from numpy import frombuffer > > nrecords = 5 > bytesize = 8 > recordsize = nrecords * bytesize > data = sc.binaryRecords('test.bin', recordsize) > parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float')) > > > # these should be equal > parsed.first() > dat[0,:] > > > Does that help? > > ------------------------- > jeremyfreeman.net > @thefreemanlab > > On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan <kvi...@vt.edu> wrote: > > What are some efficient ways to read a large file into RDDs? > > For example, have several executors read a specific/unique portion of the > file and construct RDDs. Is this possible to do in Spark? > > Currently, I am doing a line-by-line read of the file at the driver and > constructing the RDD. > > > > > >