Re: Spark nature of file split

Brad Tue, 10 Feb 2015 03:58:43 -0800

Have you been able to confirm this behaviour since posting? Have you tried
this out on multiple workers and viewed their memory consumption?


I'm new to Spark and don't have a cluster to play with at present, and want
to do similar loading from NFS files. 

My understanding is that calls to SparkContext.textFiles("filename.csv", 5)
in this example will use 5 partitions and this would mean that 5 workers
could read the same CSV file simultaneously, but they would each read a
different offset of the file (i.e. they don't all read the entire file, just
1/5th of it).


dbakumar wrote
> I am new to Spark and understanding RDD.  i have file of 30GB (csv & NFS
> mounted)  and 1 master node and 3 worker node.  does it each Spark worker
> load 30GB file OR spark allocate partition automatically and each worker
> load only allocated partition to memory?

I am also wondering how best to group the data once loaded because, in my
case, I will want the RDD partitioned by a business key, which will require
reshuffling AFAIK.

See my question:
http://stackoverflow.com/questions/28415258/apache-spark-loading-csv-files-from-nfs-and-partitioning-the-data



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-nature-of-file-split-tp21445p21574.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Spark nature of file split

Reply via email to