Is it because countByValue or toArray put too much stress on the driver, if
there are many unique words
To me it is a typical word count problem, then you can solve it as follows
(correct me if I am wrong)
val textFile = sc.textFile(“file)
val counts = textFile.flatMap(line = line.split(
Hi Zhan,
Thanks for looking into this. I'm actually using the hash map as an example
of the simplest snippet of code that is failing for me. I know that this is
just the word count. In my actual problem I'm using a Trie data structure
to find substring matches.
On Sun, Aug 17, 2014 at 11:35 PM,
Not sure exactly how you use it. My understanding is that in spark it would be
better to keep the overhead of driver as less as possible. Is it possible to
broadcast trie to executors, do computation there and then aggregate the
counters (??) in reduct phase?
Thanks.
Zhan Zhang
On Aug 18,
Hi Xiangrui,
I actually tried branch-1.1 and master and it resulted in the job being
stuck at the TaskSetManager:
14/08/16 06:55:48 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0
with 2 tasks
14/08/16 06:55:48 INFO scheduler.TaskSetManager: Starting task 1.0:0 as TID
2 on executor 8:
Hi Xiangrui,
I wasn't setting spark.driver.memory. I'll try that and report back.
In terms of this running on the cluster, my assumption was that calling foreach
on an array(I converted samples using toArray) would mean counts is propagated
locally. The object would then be serialized to
Setting spark.driver.memory has no effect. It's still hanging trying to
compute result.count when I'm sampling greater than 35% regardless of what
value of spark.driver.memory I'm setting.
Here's my settings:
export SPARK_JAVA_OPTS=-Xms5g -Xmx10g -XX:MaxPermSize=10g
export SPARK_MEM=10g
in