Hi everyone, We have an HDFS set up of a namenode and three datanodes all on EC2 mediums. One of our data partitions basically has files that are fed from a few Flume instances rolling hourly. This equates to around 3 16mb files right now, all though our traffic even now is projected to double in the next few weeks.
Our Mesos cluster consists of a Master and three slave nodes on EC2 mediums as well. Spark scheduled jobs are launched from the master across the cluster. My question is, for grabbing on the order of 3 hours of data this size, what would the expected Spark performance be? For a simple count query of our thousands od data entries serialized in these sequence files, we are seeing query times of around 180-200 seconds. While this is surely faster than Hadoop, we were under the impression that the response times would be significantly faster than this. Has anyone tested Spark+HDFS on instances smaller than the XL's?
