We have a similar setup using 3 Large EC2 nodes.  We get 64MB of logs from 
flume roughly every 2 minutes pushed to S3, and are able to have Spark read a 
single 64MB file from S3 and process it in about 30 seconds (doing multiple 
maps and a reduce by key).  

When we first started out though we saw very long processing times around the 
order of 6 minutes for a 64 MB file.  It turned out to be caused by one of our 
map closures that was referencing a singleton object that was created outside 
of the filter closure.  

Don't know if that's the case here but first thing I would check is try to run 
the job locally and use something like visualvm to see how many threads it's 
using.

--Russell

On Oct 1, 2013, at 10:54 AM, Gary Malouf <[email protected]> wrote:

> Hi everyone,
> 
> We have an HDFS set up of a namenode and three datanodes all on EC2 mediums.  
> One of our data partitions basically has files that are fed from a few Flume 
> instances rolling hourly.  This equates to around 3 16mb files right now, all 
> though our traffic even now is projected to double in the next few weeks.
> 
> Our Mesos cluster consists of a Master and three slave nodes on EC2 mediums 
> as well.  Spark scheduled jobs are launched from the master across the 
> cluster.  
> 
> My question is, for grabbing on the order of 3 hours of data this size, what 
> would the expected Spark performance be?  For a simple count query of our 
> thousands od data entries serialized in these sequence files, we are seeing 
> query times of around 180-200 seconds.  While this is surely faster than 
> Hadoop, we were under the impression that the response times would be 
> significantly faster than this.
> 
> Has anyone tested Spark+HDFS on instances smaller than the XL's?
> 
> 

Reply via email to