What does Spark cache() actually do?

2014-05-16 Thread PengWeiPRC
Hi there, I was wondering if some one could explain me how the cache() function works in Spark in these phases: (1) If I have a huge file, say 1TB, which cannot be entirely stored in Memory. What will happen if I try to create a RDD of this huge file and cache? (2) If it works in Spark, it can

Re: How to handle this situation: Huge File Shared by All maps and Each Computer Has one copy?

2014-05-01 Thread PengWeiPRC
Thanks, Rustagi. Yes, the global data is read-only and stays from the beginning to the end of the whole Spark task. Actually, it is not only identical for one Map/Reduce task, but used by a lot of map/reduce tasks of mine. That's why I intend to put the data into each node of my cluster, and hope

How to handle this situation: Huge File Shared by All maps and Each Computer Has one copy?

2014-04-30 Thread PengWeiPRC
Hi there, I was wondering if somebody could give me some suggestions about how to handle this situation: I have a spark program, in which it reads a 6GB file first (Not RDD) locally, and then do the map/reduce tasks. This 6GB file contains information that will be shared by all the map tasks.