Have you been able to confirm this behaviour since posting? Have you tried
this out on multiple workers and viewed their memory consumption? 

I'm new to Spark and don't have a cluster to play with at present, and want
to do similar loading from NFS files. 

My understanding is that calls to SparkContext.textFiles("filename.csv", 5)
in this example will use 5 partitions and this would mean that 5 workers
could read the same CSV file simultaneously, but they would each read a
different offset of the file (i.e. they don't all read the entire file, just
1/5th of it).


dbakumar wrote
> I am new to Spark and understanding RDD.  i have file of 30GB (csv & NFS
> mounted)  and 1 master node and 3 worker node.  does it each Spark worker
> load 30GB file OR spark allocate partition automatically and each worker
> load only allocated partition to memory?

I am also wondering how best to group the data once loaded because, in my
case, I will want the RDD partitioned by a business key, which will require
reshuffling AFAIK.

See my question:
http://stackoverflow.com/questions/28415258/apache-spark-loading-csv-files-from-nfs-and-partitioning-the-data



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-nature-of-file-split-tp21445p21574.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to