I bring this up because the performance we are seeing is dreadful. From cpu usage, it appears the issue is the spark shell cpu power. We have increased this node from a EC2 medium to an xl, we are seeing slightly better performance but still not great.
My understanding of Spark was that most of the work should be done on the slaves with just the results going back to the shell at the end if we do a take. It appears from what we see that the client is doing much more work than expected. On Wed, Nov 13, 2013 at 10:40 PM, Gary Malouf <[email protected]> wrote: > Hi, > > We have an HDFS set up of a namenode and three datanodes all on EC2 > larges. One of our data partitions basically has files that are fed from a > few Flume instances rolling *hourly*. This equates to around 3 4-8mb > files per hour right now > > Our Mesos cluster consists of a Master and the three slave nodes colocated > on these EC2 larges as well (slaves -> datanodes, mesos master -> > namenode). Spark scheduled jobs are launched from spark shell ad-hoc today. > > The data is serialized protobuf messages in sequence files. Our > operations typically consist of deserializing the data, grabbing a few > primitive fields out of the message and doing some maps/reduces. > > For grabbing on the order of 2 days of data this size, what would the > expected Spark performance be? We are seeing simple maps and 'takes' on > this data taking on the order of 15 minutes. > > Thanks, > > Gary >
