Thanks everyone for the inputs. I guess I will try out a custom implementation of InputFormat. But I have no idea where to start. Are there any code examples of this that might help?
On Fri, Apr 3, 2015 at 9:15 AM, Dean Wampler <deanwamp...@gmail.com> wrote: > This might be overkill for your needs, but the scodec parser combinator > library might be useful for creating a parser. > > https://github.com/scodec/scodec > > Dean Wampler, Ph.D. > Author: Programming Scala, 2nd Edition > <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) > Typesafe <http://typesafe.com> > @deanwampler <http://twitter.com/deanwampler> > http://polyglotprogramming.com > > On Thu, Apr 2, 2015 at 6:53 PM, java8964 <java8...@hotmail.com> wrote: > >> I think implementing your own InputFormat and using >> SparkContext.hadoopFile() is the best option for your case. >> >> Yong >> >> ------------------------------ >> From: kvi...@vt.edu >> Date: Thu, 2 Apr 2015 17:31:30 -0400 >> Subject: Re: Reading a large file (binary) into RDD >> To: freeman.jer...@gmail.com >> CC: user@spark.apache.org >> >> >> The file has a specific structure. I outline it below. >> >> The input file is basically a representation of a graph. >> >> INT >> INT (A) >> LONG (B) >> A INTs (Degrees) >> A SHORTINTs (Vertex_Attribute) >> B INTs >> B INTs >> B SHORTINTs >> B SHORTINTs >> >> A - number of vertices >> B - number of edges (note that the INTs/SHORTINTs associated with this >> are edge attributes) >> >> After reading in the file, I need to create two RDDs (one with vertices >> and the other with edges) >> >> On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman <freeman.jer...@gmail.com> >> wrote: >> >> Hm, that will indeed be trickier because this method assumes records are >> the same byte size. Is the file an arbitrary sequence of mixed types, or is >> there structure, e.g. short, long, short, long, etc.? >> >> If you could post a gist with an example of the kind of file and how it >> should look once read in that would be useful! >> >> ------------------------- >> jeremyfreeman.net >> @thefreemanlab >> >> On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan <kvi...@vt.edu> wrote: >> >> Thanks for the reply. Unfortunately, in my case, the binary file is a mix >> of short and long integers. Is there any other way that could of use here? >> >> My current method happens to have a large overhead (much more than actual >> computation time). Also, I am short of memory at the driver when it has to >> read the entire file. >> >> On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman <freeman.jer...@gmail.com> >> wrote: >> >> If it’s a flat binary file and each record is the same length (in bytes), >> you can use Spark’s binaryRecords method (defined on the SparkContext), >> which loads records from one or more large flat binary files into an RDD. >> Here’s an example in python to show how it works: >> >> # write data from an array >> from numpy import random >> dat = random.randn(100,5) >> f = open('test.bin', 'w') >> f.write(dat) >> f.close() >> >> >> # load the data back in >> >> from numpy import frombuffer >> >> nrecords = 5 >> bytesize = 8 >> recordsize = nrecords * bytesize >> data = sc.binaryRecords('test.bin', recordsize) >> parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float')) >> >> >> # these should be equal >> parsed.first() >> dat[0,:] >> >> >> Does that help? >> >> ------------------------- >> jeremyfreeman.net >> @thefreemanlab >> >> On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan <kvi...@vt.edu> wrote: >> >> What are some efficient ways to read a large file into RDDs? >> >> For example, have several executors read a specific/unique portion of the >> file and construct RDDs. Is this possible to do in Spark? >> >> Currently, I am doing a line-by-line read of the file at the driver and >> constructing the RDD. >> >> >> >> >> >> >