Trying to find a complete documentation about an internal architecture of
Apache Spark, but have no results there.

For example I'm trying to understand next thing: Assume that we have 1Tb
text file on HDFS (3 nodes in a cluster, replication factor is 1). This
file will be spitted into 128Mb chunks and each chunk will be stored only
on one node. We run Spark Workers on these nodes. I know that Spark is
trying to work with data stored in HDFS on the same node (to avoid network
I/O). For example I'm trying to do a word count in this 1Tb text file.

Here I have next questions:

   1. Does Spark will load chuck (128Mb) into RAM, count words, and then
   delete it from memory and do it sequentially? What if there will be no
   available RAM?
   2. When does Spark will use not local data on HDFS?
   3. What if I will need to do more complex task, when a results of each
   iteration on each Worker need to be transferred to all other Workers
   (shuffling?), do I need to write them by my self to HDFS and then read
   them? For example I can't understand how does K-means clustering or
   Gradient descent works on Spark.

I will appreciate any link to Apache Spark architecture guide.

-- 
Best regards,
Vitalii Duk

Reply via email to