Hi there,
I was wondering if some one could explain me how the cache() function works
in Spark in these phases:
(1) If I have a huge file, say 1TB, which cannot be entirely stored in
Memory. What will happen if I try to create a RDD of this huge file and
cache?
(2) If it works in Spark, it can
Thanks, Rustagi. Yes, the global data is read-only and stays from the
beginning to the end of the whole Spark task. Actually, it is not only
identical for one Map/Reduce task, but used by a lot of map/reduce tasks of
mine. That's why I intend to put the data into each node of my cluster, and
hope
Hi there,
I was wondering if somebody could give me some suggestions about how to
handle this situation:
I have a spark program, in which it reads a 6GB file first (Not RDD)
locally, and then do the map/reduce tasks. This 6GB file contains
information that will be shared by all the map tasks.