Re: Stalling during large iterative PySpark jobs

2014-01-26 Thread Matei Zaharia
Jeremy, do you happen to have a small test case that reproduces it? Is it with the kmeans example that comes with PySpark? Matei On Jan 22, 2014, at 3:03 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: Thanks for the thoughts Matei! I poked at this some more. I ran top on each of the

GroupByKey implementation.

2014-01-26 Thread Archit Thakur
Hi, Below is the implementation for GroupByKey. (v, 0.8.0) def groupByKey(partitioner: Partitioner): RDD[(K, Seq[V])] = { def createCombiner(v: V) = ArrayBuffer(v) def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v val bufs = combineByKey[ArrayBuffer[V]]( createCombiner _,

Re: ExternalAppendOnlyMap throw no such element

2014-01-26 Thread Patrick Wendell
Hey There, So one thing you can do is disable the external sorting, this should preserve the behavior exactly was it was in previous releases. It's quite possible that the problem you are having relates to the fact that you have individual records that are 1GB in size. This is a pretty extreme

Re: how to set SPARK_WORKER_INSTANCES and SPARK_WORKER_CORES otpimally

2014-01-26 Thread Chen Jin
Hi Ankit, Thanks for detailed explanation. Since my cluster has 5 machines each of which has 8 cores and 48g memory, I was meant to say for the entire cluster: (a) gives us 40 workers with each core per worker (b) gives 5 workers while each worker has eight cores. A follow-up question, since

Re: Spark Scheduler

2014-01-26 Thread Sai Prasanna
Thathanga Das, With respect to HDFS, i think the job seeker will return which of the replicated nodes is the preferred locations. But on a stand-alone spark system, using native filesystem, say if partitions are cached, its straightforward to return the same. IF not cached but replicated across 3

Re: ExternalAppendOnlyMap throw no such element

2014-01-26 Thread guojc
Hi Patrick, I have create the jira https://spark-project.atlassian.net/browse/SPARK-1045. It turn out the situation is related to join two large rdd, not related to the combine process as previous thought. Best Regards, Jiacheng Guo On Mon, Jan 27, 2014 at 11:07 AM, guojc guoj...@gmail.com

Re: Stalling during large iterative PySpark jobs

2014-01-26 Thread Jeremy Freeman
Yup, hitting it with the included PySpark kmeans example (v0.8.1). So the code for reproducing is simple. But note that I only get it with pretty many nodes (in our set up, 30 or more). So you should see it if you run KMeans with that many nodes, on any fairly large data set with many iterations

Re: How to create RDD over hashmap?

2014-01-26 Thread Manoj Samel
Thanks to all suggestions, I am able to make progress on it. Manoj On Fri, Jan 24, 2014 at 1:54 PM, Tathagata Das tathagata.das1...@gmail.comwrote: On this note, you can do something smarter that the basic lookup function. You could convert each partition of the key-value pair RDD into a