Spark performance on smallerish data sets: EC2 Mediums

Gary Malouf Tue, 01 Oct 2013 10:56:00 -0700

Hi everyone,

We have an HDFS set up of a namenode and three datanodes all on EC2
mediums.  One of our data partitions basically has files that are fed from
a few Flume instances rolling hourly.  This equates to around 3 16mb files
right now, all though our traffic even now is projected to double in the
next few weeks.


Our Mesos cluster consists of a Master and three slave nodes on EC2 mediums
as well.  Spark scheduled jobs are launched from the master across the
cluster.

My question is, for grabbing on the order of 3 hours of data this size,
what would the expected Spark performance be?  For a simple count query of
our thousands od data entries serialized in these sequence files, we are
seeing query times of around 180-200 seconds.  While this is surely faster
than Hadoop, we were under the impression that the response times would be
significantly faster than this.

Has anyone tested Spark+HDFS on instances smaller than the XL's?

Spark performance on smallerish data sets: EC2 Mediums

Reply via email to