Re: Reading a large file (binary) into RDD

Dean Wampler Fri, 03 Apr 2015 06:25:24 -0700

This might be overkill for your needs, but the scodec parser combinator
library might be useful for creating a parser.


https://github.com/scodec/scodec

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Thu, Apr 2, 2015 at 6:53 PM, java8964 <java8...@hotmail.com> wrote:

> I think implementing your own InputFormat and using
> SparkContext.hadoopFile() is the best option for your case.
>
> Yong
>
> ------------------------------
> From: kvi...@vt.edu
> Date: Thu, 2 Apr 2015 17:31:30 -0400
> Subject: Re: Reading a large file (binary) into RDD
> To: freeman.jer...@gmail.com
> CC: user@spark.apache.org
>
>
> The file has a specific structure. I outline it below.
>
> The input file is basically a representation of a graph.
>
> INT
> INT    (A)
> LONG (B)
> A INTs                    (Degrees)
> A SHORTINTs          (Vertex_Attribute)
> B INTs
> B INTs
> B SHORTINTs
> B SHORTINTs
>
> A - number of vertices
> B - number of edges (note that the INTs/SHORTINTs associated with this are
> edge attributes)
>
> After reading in the file, I need to create two RDDs (one with vertices
> and the other with edges)
>
> On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman <freeman.jer...@gmail.com>
> wrote:
>
> Hm, that will indeed be trickier because this method assumes records are
> the same byte size. Is the file an arbitrary sequence of mixed types, or is
> there structure, e.g. short, long, short, long, etc.?
>
> If you could post a gist with an example of the kind of file and how it
> should look once read in that would be useful!
>
> -------------------------
> jeremyfreeman.net
> @thefreemanlab
>
> On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan <kvi...@vt.edu> wrote:
>
> Thanks for the reply. Unfortunately, in my case, the binary file is a mix
> of short and long integers. Is there any other way that could of use here?
>
> My current method happens to have a large overhead (much more than actual
> computation time). Also, I am short of memory at the driver when it has to
> read the entire file.
>
> On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman <freeman.jer...@gmail.com>
> wrote:
>
> If it’s a flat binary file and each record is the same length (in bytes),
> you can use Spark’s binaryRecords method (defined on the SparkContext),
> which loads records from one or more large flat binary files into an RDD.
> Here’s an example in python to show how it works:
>
> # write data from an array
> from numpy import random
> dat = random.randn(100,5)
> f = open('test.bin', 'w')
> f.write(dat)
> f.close()
>
>
> # load the data back in
>
> from numpy import frombuffer
>
> nrecords = 5
> bytesize = 8
> recordsize = nrecords * bytesize
> data = sc.binaryRecords('test.bin', recordsize)
> parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))
>
>
> # these should be equal
> parsed.first()
> dat[0,:]
>
>
> Does that help?
>
> -------------------------
> jeremyfreeman.net
> @thefreemanlab
>
> On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan <kvi...@vt.edu> wrote:
>
> What are some efficient ways to read a large file into RDDs?
>
> For example, have several executors read a specific/unique portion of the
> file and construct RDDs. Is this possible to do in Spark?
>
> Currently, I am doing a line-by-line read of the file at the driver and
> constructing the RDD.
>
>
>
>
>
>

Re: Reading a large file (binary) into RDD

Reply via email to