/rdd/PairRDDFunctions.scala#L89
By the way, I should warn you that groupByKey() is not a recommended
operation if you can avoid it, as it has non-obvious performance issues when
running with serious data.
On Sat, Jul 12, 2014 at 12:20 PM, Guanhua Yan gh...@lanl.gov wrote:
Hi:
I have trouble
Hi:
I have trouble understanding the default partitioner (hash) in Spark.
Suppose that an RDD with two partitions is created as follows:
x = sc.parallelize([(a, 1), (b, 4), (a, 10), (c, 7)], 2)
Does spark partition x based on the hash of the key (e.g., a, b, c) by
default?
(1) Assuming this is
12:10 AM, Xiangrui Meng men...@gmail.com wrote:
You have a long lineage that causes the StackOverflow error. Try
rdd.checkPoint() and rdd.count() for every 20~30 iterations.
checkPoint can cut the lineage. -Xiangrui
On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan gh...@lanl.gov wrote:
Dear Sparkers
Dear Sparkers:
I am using Python spark of version 0.9.0 to implement some iterative
algorithm. I got some errors shown at the end of this email. It seems that
it's due to the Java Stack Overflow error. The same error has been
duplicated on a mac desktop and a linux workstation, both running the
Hi all:
Is it possible to develop Spark programs in Python and run them on YARN?
From the Python SparkContext class, it doesn't seem to have such an option.
Thank you,
- Guanhua
===
Guanhua Yan, Ph.D.
Information Sciences Group (CCS-3)
Los Alamos National Laboratory
:
https://github.com/apache/spark/pull/30
Matei
On Apr 29, 2014, at 9:51 AM, Guanhua Yan gh...@lanl.gov wrote:
Hi all:
Is it possible to develop Spark programs in Python and run them on YARN? From
the Python SparkContext class, it doesn't seem to have such an option.
Thank you,
- Guanhua