date:20140126

Re: Stalling during large iterative PySpark jobs

2014-01-26 Thread Matei Zaharia

Jeremy, do you happen to have a small test case that reproduces it? Is it with the kmeans example that comes with PySpark? Matei On Jan 22, 2014, at 3:03 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: Thanks for the thoughts Matei! I poked at this some more. I ran top on each of the

GroupByKey implementation.

2014-01-26 Thread Archit Thakur

Hi, Below is the implementation for GroupByKey. (v, 0.8.0) def groupByKey(partitioner: Partitioner): RDD[(K, Seq[V])] = { def createCombiner(v: V) = ArrayBuffer(v) def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v val bufs = combineByKey[ArrayBuffer[V]]( createCombiner _,

Re: ExternalAppendOnlyMap throw no such element

2014-01-26 Thread Patrick Wendell

Hey There, So one thing you can do is disable the external sorting, this should preserve the behavior exactly was it was in previous releases. It's quite possible that the problem you are having relates to the fact that you have individual records that are 1GB in size. This is a pretty extreme

Re: how to set SPARK_WORKER_INSTANCES and SPARK_WORKER_CORES otpimally

2014-01-26 Thread Chen Jin

Hi Ankit, Thanks for detailed explanation. Since my cluster has 5 machines each of which has 8 cores and 48g memory, I was meant to say for the entire cluster: (a) gives us 40 workers with each core per worker (b) gives 5 workers while each worker has eight cores. A follow-up question, since

Re: Spark Scheduler

2014-01-26 Thread Sai Prasanna

Thathanga Das, With respect to HDFS, i think the job seeker will return which of the replicated nodes is the preferred locations. But on a stand-alone spark system, using native filesystem, say if partitions are cached, its straightforward to return the same. IF not cached but replicated across 3

Re: ExternalAppendOnlyMap throw no such element

2014-01-26 Thread guojc

Hi Patrick, I have create the jira https://spark-project.atlassian.net/browse/SPARK-1045. It turn out the situation is related to join two large rdd, not related to the combine process as previous thought. Best Regards, Jiacheng Guo On Mon, Jan 27, 2014 at 11:07 AM, guojc guoj...@gmail.com

Re: Stalling during large iterative PySpark jobs

2014-01-26 Thread Jeremy Freeman

Yup, hitting it with the included PySpark kmeans example (v0.8.1). So the code for reproducing is simple. But note that I only get it with pretty many nodes (in our set up, 30 or more). So you should see it if you run KMeans with that many nodes, on any fairly large data set with many iterations

Re: How to create RDD over hashmap?

2014-01-26 Thread Manoj Samel

Thanks to all suggestions, I am able to make progress on it. Manoj On Fri, Jan 24, 2014 at 1:54 PM, Tathagata Das tathagata.das1...@gmail.comwrote: On this note, you can do something smarter that the basic lookup function. You could convert each partition of the key-value pair RDD into a

Re: Stalling during large iterative PySpark jobs

GroupByKey implementation.

Re: ExternalAppendOnlyMap throw no such element

Re: how to set SPARK_WORKER_INSTANCES and SPARK_WORKER_CORES otpimally

Re: Spark Scheduler

Re: ExternalAppendOnlyMap throw no such element

Re: Stalling during large iterative PySpark jobs

Re: How to create RDD over hashmap?

8 matches

Site Navigation

Mail list logo

Footer information