I'm trying to understand the conceptual difference between these two
configurations in term of performance (using Spark standalone cluster)

Case 1:

1 Node
60 cores
240G of memory
50G of data on local file system

Case 2:

6 Nodes
10 cores per node
40G of memory per node
50G of data on HDFS
nodes are connected using a 10G network

I just wanted to validate my understanding.

1.  Reads in case 1 will be slower compared to case 2 because, in case 2
all 6 nodes can read the data in parallel from HDFS. However, if I change
the file system to HDFS in Case 1, my read speeds will be conceptually the
same as case 2. Correct ?

2. Once the data is loaded, case 1 will execute operations faster because
there is no network overhead and all shuffle operations are local.

3. Obviously, case 1 is bad from a fault tolerance point of view because we
have a single point of failure.

Thanks
-Soumya

Reply via email to