Jeremy, do you happen to have a small test case that reproduces it? Is it with
the kmeans example that comes with PySpark?
Matei
On Jan 22, 2014, at 3:03 PM, Jeremy Freeman freeman.jer...@gmail.com wrote:
Thanks for the thoughts Matei! I poked at this some more. I ran top on each
of the
Hi,
Below is the implementation for GroupByKey. (v, 0.8.0)
def groupByKey(partitioner: Partitioner): RDD[(K, Seq[V])] = {
def createCombiner(v: V) = ArrayBuffer(v)
def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v
val bufs = combineByKey[ArrayBuffer[V]](
createCombiner _,
Hey There,
So one thing you can do is disable the external sorting, this should
preserve the behavior exactly was it was in previous releases.
It's quite possible that the problem you are having relates to the
fact that you have individual records that are 1GB in size. This is a
pretty extreme
Hi Ankit,
Thanks for detailed explanation. Since my cluster has 5 machines each
of which has 8 cores and 48g memory, I was meant to say for the entire
cluster:
(a) gives us 40 workers with each core per worker (b) gives 5 workers
while each worker has eight cores.
A follow-up question, since
Thathanga Das, With respect to HDFS, i think the job seeker will return
which of the replicated nodes is the preferred locations. But on a
stand-alone spark system, using native filesystem, say if partitions are
cached, its straightforward to return the same. IF not cached but
replicated across 3
Hi Patrick,
I have create the jira
https://spark-project.atlassian.net/browse/SPARK-1045. It turn out the
situation is related to join two large rdd, not related to the combine
process as previous thought.
Best Regards,
Jiacheng Guo
On Mon, Jan 27, 2014 at 11:07 AM, guojc guoj...@gmail.com
Yup, hitting it with the included PySpark kmeans example (v0.8.1). So the
code for reproducing is simple. But note that I only get it with pretty many
nodes (in our set up, 30 or more). So you should see it if you run KMeans
with that many nodes, on any fairly large data set with many iterations
Thanks to all suggestions, I am able to make progress on it.
Manoj
On Fri, Jan 24, 2014 at 1:54 PM, Tathagata Das
tathagata.das1...@gmail.comwrote:
On this note, you can do something smarter that the basic lookup function.
You could convert each partition of the key-value pair RDD into a