Re: Reading a large file (binary) into RDD

Vijayasarathy Kannan Fri, 03 Apr 2015 08:45:19 -0700

Thanks everyone for the inputs.

I guess I will try out a custom implementation of InputFormat. But I have
no idea where to start. Are there any code examples of this that might help?


On Fri, Apr 3, 2015 at 9:15 AM, Dean Wampler <deanwamp...@gmail.com> wrote:

> This might be overkill for your needs, but the scodec parser combinator
> library might be useful for creating a parser.
>
> https://github.com/scodec/scodec
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> Typesafe <http://typesafe.com>
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
> On Thu, Apr 2, 2015 at 6:53 PM, java8964 <java8...@hotmail.com> wrote:
>
>> I think implementing your own InputFormat and using
>> SparkContext.hadoopFile() is the best option for your case.
>>
>> Yong
>>
>> ------------------------------
>> From: kvi...@vt.edu
>> Date: Thu, 2 Apr 2015 17:31:30 -0400
>> Subject: Re: Reading a large file (binary) into RDD
>> To: freeman.jer...@gmail.com
>> CC: user@spark.apache.org
>>
>>
>> The file has a specific structure. I outline it below.
>>
>> The input file is basically a representation of a graph.
>>
>> INT
>> INT    (A)
>> LONG (B)
>> A INTs                    (Degrees)
>> A SHORTINTs          (Vertex_Attribute)
>> B INTs
>> B INTs
>> B SHORTINTs
>> B SHORTINTs
>>
>> A - number of vertices
>> B - number of edges (note that the INTs/SHORTINTs associated with this
>> are edge attributes)
>>
>> After reading in the file, I need to create two RDDs (one with vertices
>> and the other with edges)
>>
>> On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman <freeman.jer...@gmail.com>
>> wrote:
>>
>> Hm, that will indeed be trickier because this method assumes records are
>> the same byte size. Is the file an arbitrary sequence of mixed types, or is
>> there structure, e.g. short, long, short, long, etc.?
>>
>> If you could post a gist with an example of the kind of file and how it
>> should look once read in that would be useful!
>>
>> -------------------------
>> jeremyfreeman.net
>> @thefreemanlab
>>
>> On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan <kvi...@vt.edu> wrote:
>>
>> Thanks for the reply. Unfortunately, in my case, the binary file is a mix
>> of short and long integers. Is there any other way that could of use here?
>>
>> My current method happens to have a large overhead (much more than actual
>> computation time). Also, I am short of memory at the driver when it has to
>> read the entire file.
>>
>> On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman <freeman.jer...@gmail.com>
>> wrote:
>>
>> If it’s a flat binary file and each record is the same length (in bytes),
>> you can use Spark’s binaryRecords method (defined on the SparkContext),
>> which loads records from one or more large flat binary files into an RDD.
>> Here’s an example in python to show how it works:
>>
>> # write data from an array
>> from numpy import random
>> dat = random.randn(100,5)
>> f = open('test.bin', 'w')
>> f.write(dat)
>> f.close()
>>
>>
>> # load the data back in
>>
>> from numpy import frombuffer
>>
>> nrecords = 5
>> bytesize = 8
>> recordsize = nrecords * bytesize
>> data = sc.binaryRecords('test.bin', recordsize)
>> parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))
>>
>>
>> # these should be equal
>> parsed.first()
>> dat[0,:]
>>
>>
>> Does that help?
>>
>> -------------------------
>> jeremyfreeman.net
>> @thefreemanlab
>>
>> On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan <kvi...@vt.edu> wrote:
>>
>> What are some efficient ways to read a large file into RDDs?
>>
>> For example, have several executors read a specific/unique portion of the
>> file and construct RDDs. Is this possible to do in Spark?
>>
>> Currently, I am doing a line-by-line read of the file at the driver and
>> constructing the RDD.
>>
>>
>>
>>
>>
>>
>

Re: Reading a large file (binary) into RDD

Reply via email to