Hi everyone,
I am very new to Spark, so as a learning exercise I've set up a small
cluster consisting of 4 EC2 m1.large instances (1 master, 3 slaves), which
I'm hoping to use to calculate ngram frequencies from text files of various
sizes (I'm not doing anything with them; I just thought this would be
slightly more interesting than the usual 'word count' example). Currently,
I'm trying to work with a 1GB text file, but running into memory issues.
I'm wondering what parameters I should be setting (in spark-env.sh) in
order to properly utilize the cluster. Right now, I'd be happy just to
have the process complete successfully with the 1 gig file, so I'd really
appreciate any suggestions you all might have.
Here's a summary of the code I'm running through the spark shell on the
master:
def ngrams(s: String, n: Int = 3): Seq[String] = {
(s.split("\\s+").sliding(n)).filter(_.length == n).map(_.mkString("
")).map(_.trim).toList
}
val text = sc.textFile("s3n://my-bucket/my-1gb-text-file")
val mapped = text.filter(_.trim.length > 0).flatMap(ngrams(_, 3))
So far so good; the problems come during the reduce phase. With small
files, I was able to issue the following to calculate the most frequently
occurring trigram:
val topNgram = (mapped countByValue) reduce((a:(String, Long), b:(String,
Long)) => if (a._2 > b._2) a else b)
With the 1 gig file, though, I've been running into OutOfMemory errors, so
I decided to split the reduction to several steps, starting with simply
issuing countByValue of my "mapped" RDD, but I have yet to get it to
complete successfully.
SPARK_MEM is currently set to 6154m. I also bumped up the
spark.akka.framesize setting to 500 (though at this point, I was grasping
at straws; I'm not sure what a "proper" value would be). What properties
should I be setting for a job of this size on a cluster of 3 m1.large
slaves? (The cluster was initially configured using the spark-ec2 scripts).
Also, programmatically, what should I be doing differently? (For example,
should I be setting the minimum number of splits when reading the text
file? If so, what would be a good default?).
I apologize for what I'm sure are very naive questions. I think Spark is a
fantastic project and have enjoyed working with it, but I'm still very much
a newbie and would appreciate any help you all can provide (as well as any
'rules-of-thumb' or best practices I should be following).
Thanks,
Tim Perrigo