I'm trying to understand the conceptual difference between these two configurations in term of performance (using Spark standalone cluster)
Case 1: 1 Node 60 cores 240G of memory 50G of data on local file system Case 2: 6 Nodes 10 cores per node 40G of memory per node 50G of data on HDFS nodes are connected using a 10G network I just wanted to validate my understanding. 1. Reads in case 1 will be slower compared to case 2 because, in case 2 all 6 nodes can read the data in parallel from HDFS. However, if I change the file system to HDFS in Case 1, my read speeds will be conceptually the same as case 2. Correct ? 2. Once the data is loaded, case 1 will execute operations faster because there is no network overhead and all shuffle operations are local. 3. Obviously, case 1 is bad from a fault tolerance point of view because we have a single point of failure. Thanks -Soumya