Hi Russell, I think I did not clarify that my set up has HDFS on separate nodes from Spark. It sounds like your setup has them together right?
On Tue, Oct 1, 2013 at 11:23 PM, Russell Cardullo <[email protected] > wrote: > We have a similar setup using 3 Large EC2 nodes. We get 64MB of logs from > flume roughly every 2 minutes pushed to S3, and are able to have Spark read > a single 64MB file from S3 and process it in about 30 seconds (doing > multiple maps and a reduce by key). > > When we first started out though we saw very long processing times around > the order of 6 minutes for a 64 MB file. It turned out to be caused by one > of our map closures that was referencing a singleton object that was > created outside of the filter closure. > > Don't know if that's the case here but first thing I would check is try to > run the job locally and use something like visualvm to see how many threads > it's using. > > --Russell > > On Oct 1, 2013, at 10:54 AM, Gary Malouf <[email protected]> wrote: > > > Hi everyone, > > > > We have an HDFS set up of a namenode and three datanodes all on EC2 > mediums. One of our data partitions basically has files that are fed from > a few Flume instances rolling hourly. This equates to around 3 16mb files > right now, all though our traffic even now is projected to double in the > next few weeks. > > > > Our Mesos cluster consists of a Master and three slave nodes on EC2 > mediums as well. Spark scheduled jobs are launched from the master across > the cluster. > > > > My question is, for grabbing on the order of 3 hours of data this size, > what would the expected Spark performance be? For a simple count query of > our thousands od data entries serialized in these sequence files, we are > seeing query times of around 180-200 seconds. While this is surely faster > than Hadoop, we were under the impression that the response times would be > significantly faster than this. > > > > Has anyone tested Spark+HDFS on instances smaller than the XL's? > > > > > >
