Hadoop TextInputFormat is a good start.
It is not really that hard. You just need to implement the logic to identify 
the Record delimiter, and think a logic way to represent the <Key, Value> for 
your RecordReader.
Yong

From: kvi...@vt.edu
Date: Fri, 3 Apr 2015 11:41:13 -0400
Subject: Re: Reading a large file (binary) into RDD
To: deanwamp...@gmail.com
CC: java8...@hotmail.com; user@spark.apache.org

Thanks everyone for the inputs.
I guess I will try out a custom implementation of InputFormat. But I have no 
idea where to start. Are there any code examples of this that might help?
On Fri, Apr 3, 2015 at 9:15 AM, Dean Wampler <deanwamp...@gmail.com> wrote:
This might be overkill for your needs, but the scodec parser combinator library 
might be useful for creating a parser.
https://github.com/scodec/scodec
Dean Wampler, Ph.D.Author: Programming Scala, 2nd Edition (O'Reilly)
Typesafe
@deanwamplerhttp://polyglotprogramming.com

On Thu, Apr 2, 2015 at 6:53 PM, java8964 <java8...@hotmail.com> wrote:



I think implementing your own InputFormat and using SparkContext.hadoopFile() 
is the best option for your case.
Yong

From: kvi...@vt.edu
Date: Thu, 2 Apr 2015 17:31:30 -0400
Subject: Re: Reading a large file (binary) into RDD
To: freeman.jer...@gmail.com
CC: user@spark.apache.org

The file has a specific structure. I outline it below.
The input file is basically a representation of a graph.

INTINT    (A)LONG (B)A INTs                    (Degrees)A SHORTINTs          
(Vertex_Attribute)B INTsB INTsB SHORTINTsB SHORTINTs

A - number of verticesB - number of edges (note that the INTs/SHORTINTs 
associated with this are edge attributes)
After reading in the file, I need to create two RDDs (one with vertices and the 
other with edges)
On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman <freeman.jer...@gmail.com> wrote:
Hm, that will indeed be trickier because this method assumes records are the 
same byte size. Is the file an arbitrary sequence of mixed types, or is there 
structure, e.g. short, long, short, long, etc.? 
If you could post a gist with an example of the kind of file and how it should 
look once read in that would be useful!


-------------------------
jeremyfreeman.net
@thefreemanlab



On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan <kvi...@vt.edu> wrote:
Thanks for the reply. Unfortunately, in my case, the binary file is a mix of 
short and long integers. Is there any other way that could of use here?
My current method happens to have a large overhead (much more than actual 
computation time). Also, I am short of memory at the driver when it has to read 
the entire file.
On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman <freeman.jer...@gmail.com> wrote:
If it’s a flat binary file and each record is the same length (in bytes), you 
can use Spark’s binaryRecords method (defined on the SparkContext), which loads 
records from one or more large flat binary files into an RDD. Here’s an example 
in python to show how it works:
# write data from an arrayfrom numpy import randomdat = random.randn(100,5)f = 
open('test.bin', 'w')f.write(dat)f.close()
# load the data back infrom numpy import frombuffernrecords = 5bytesize = 
8recordsize = nrecords * bytesizedata = sc.binaryRecords('test.bin', 
recordsize)parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 
'float'))

# these should be equalparsed.first()dat[0,:]
Does that help?
-------------------------
jeremyfreeman.net
@thefreemanlab


On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan <kvi...@vt.edu> wrote:
What are some efficient ways to read a large file into RDDs?
For example, have several executors read a specific/unique portion of the file 
and construct RDDs. Is this possible to do in Spark?
Currently, I am doing a line-by-line read of the file at the driver and 
constructing the RDD.





                                          



                                          

Reply via email to