Re: Spark performance on smallerish data sets: EC2 Mediums

Russell Cardullo Tue, 01 Oct 2013 20:47:18 -0700

No, we're reading from S3 rather than HDFS.


On Oct 1, 2013, at 8:42 PM, Gary Malouf <[email protected]> wrote:

> Hi Russell,
> 
> I think I did not clarify that my set up has HDFS on separate nodes from 
> Spark.  It sounds like your setup has them together right?
> 
> 
> On Tue, Oct 1, 2013 at 11:23 PM, Russell Cardullo <[email protected]> 
> wrote:
> We have a similar setup using 3 Large EC2 nodes.  We get 64MB of logs from 
> flume roughly every 2 minutes pushed to S3, and are able to have Spark read a 
> single 64MB file from S3 and process it in about 30 seconds (doing multiple 
> maps and a reduce by key).
> 
> When we first started out though we saw very long processing times around the 
> order of 6 minutes for a 64 MB file.  It turned out to be caused by one of 
> our map closures that was referencing a singleton object that was created 
> outside of the filter closure.
> 
> Don't know if that's the case here but first thing I would check is try to 
> run the job locally and use something like visualvm to see how many threads 
> it's using.
> 
> --Russell
> 
> On Oct 1, 2013, at 10:54 AM, Gary Malouf <[email protected]> wrote:
> 
> > Hi everyone,
> >
> > We have an HDFS set up of a namenode and three datanodes all on EC2 
> > mediums.  One of our data partitions basically has files that are fed from 
> > a few Flume instances rolling hourly.  This equates to around 3 16mb files 
> > right now, all though our traffic even now is projected to double in the 
> > next few weeks.
> >
> > Our Mesos cluster consists of a Master and three slave nodes on EC2 mediums 
> > as well.  Spark scheduled jobs are launched from the master across the 
> > cluster.
> >
> > My question is, for grabbing on the order of 3 hours of data this size, 
> > what would the expected Spark performance be?  For a simple count query of 
> > our thousands od data entries serialized in these sequence files, we are 
> > seeing query times of around 180-200 seconds.  While this is surely faster 
> > than Hadoop, we were under the impression that the response times would be 
> > significantly faster than this.
> >
> > Has anyone tested Spark+HDFS on instances smaller than the XL's?
> >
> >
> 
>

Re: Spark performance on smallerish data sets: EC2 Mediums

Reply via email to