Hi Shay and Timothy, We're very aware of this issue, and we want to improve both the documentation and the out-of-the-box behavior for these cases. Right now the closest thing is the tuning guide here: http://spark.incubator.apache.org/docs/latest/tuning.html, but it's just a small step in this direction. But basically because these issues are quite common, the goal should be to eliminate them completely.
Matei On Oct 22, 2013, at 7:22 AM, Shay Seng <[email protected]> wrote: > Hi Matei, > > I've seen several memory tuning queries on this mailing list, and also heard > the same kinds of queries at the spark meetup. In fact the last bullet point > in Josh Carver(?) slides, the guy from Bizo, was "memory tuning is still a > mystery". > > I certainly had lots of issues in when I first started. From memory issues to > gc issues, things seem to run fine until you try something with 500GB of data > etc. > > I was wondering if you could write up a little white paper or some guide > lines on how to set memory values, and what to look at when something goes > wrong? Eg. I would never gave guessed that countByValue happens on a single > machine etc. > > On Oct 21, 2013 6:18 PM, "Matei Zaharia" <[email protected]> wrote: > Hi there, > > The problem is that countByValue happens in only a single reduce task -- this > is probably something we should fix but it's basically not designed for lots > of values. Instead, do the count in parallel as follows: > > val counts = mapped.map(str => (str, 1)).reduceByKey((a, b) => a + b) > > If this still has trouble, you can also increase the level of parallelism of > reduceByKey by passing it a second parameter for the number of tasks (e.g. > 100). > > BTW one other small thing with your code, flatMap should actually work fine > if your function returns an Iterator to Traversable, so there's no need to > call toList and return a Seq in ngrams; you can just return an > Iterator[String]. > > Matei > > On Oct 21, 2013, at 1:05 PM, Timothy Perrigo <[email protected]> wrote: > > > Hi everyone, > > I am very new to Spark, so as a learning exercise I've set up a small > > cluster consisting of 4 EC2 m1.large instances (1 master, 3 slaves), which > > I'm hoping to use to calculate ngram frequencies from text files of various > > sizes (I'm not doing anything with them; I just thought this would be > > slightly more interesting than the usual 'word count' example). Currently, > > I'm trying to work with a 1GB text file, but running into memory issues. > > I'm wondering what parameters I should be setting (in spark-env.sh) in > > order to properly utilize the cluster. Right now, I'd be happy just to > > have the process complete successfully with the 1 gig file, so I'd really > > appreciate any suggestions you all might have. > > > > Here's a summary of the code I'm running through the spark shell on the > > master: > > > > def ngrams(s: String, n: Int = 3): Seq[String] = { > > (s.split("\\s+").sliding(n)).filter(_.length == n).map(_.mkString(" > > ")).map(_.trim).toList > > } > > > > val text = sc.textFile("s3n://my-bucket/my-1gb-text-file") > > > > val mapped = text.filter(_.trim.length > 0).flatMap(ngrams(_, 3)) > > > > So far so good; the problems come during the reduce phase. With small > > files, I was able to issue the following to calculate the most frequently > > occurring trigram: > > > > val topNgram = (mapped countByValue) reduce((a:(String, Long), b:(String, > > Long)) => if (a._2 > b._2) a else b) > > > > With the 1 gig file, though, I've been running into OutOfMemory errors, so > > I decided to split the reduction to several steps, starting with simply > > issuing countByValue of my "mapped" RDD, but I have yet to get it to > > complete successfully. > > > > SPARK_MEM is currently set to 6154m. I also bumped up the > > spark.akka.framesize setting to 500 (though at this point, I was grasping > > at straws; I'm not sure what a "proper" value would be). What properties > > should I be setting for a job of this size on a cluster of 3 m1.large > > slaves? (The cluster was initially configured using the spark-ec2 scripts). > > Also, programmatically, what should I be doing differently? (For example, > > should I be setting the minimum number of splits when reading the text > > file? If so, what would be a good default?). > > > > I apologize for what I'm sure are very naive questions. I think Spark is a > > fantastic project and have enjoyed working with it, but I'm still very much > > a newbie and would appreciate any help you all can provide (as well as any > > 'rules-of-thumb' or best practices I should be following). > > > > Thanks, > > Tim Perrigo >
