Re: Confused by groupByKey() and the default partitioner

2014-07-13 Thread Guanhua Yan
/rdd/PairRDDFunctions.scala#L89 By the way, I should warn you that groupByKey() is not a recommended operation if you can avoid it, as it has non-obvious performance issues when running with serious data. On Sat, Jul 12, 2014 at 12:20 PM, Guanhua Yan gh...@lanl.gov wrote: Hi: I have trouble

Confused by groupByKey() and the default partitioner

2014-07-12 Thread Guanhua Yan
Hi: I have trouble understanding the default partitioner (hash) in Spark. Suppose that an RDD with two partitions is created as follows: x = sc.parallelize([(a, 1), (b, 4), (a, 10), (c, 7)], 2) Does spark partition x based on the hash of the key (e.g., a, b, c) by default? (1) Assuming this is

Re: java.lang.StackOverflowError when calling count()

2014-05-13 Thread Guanhua Yan
12:10 AM, Xiangrui Meng men...@gmail.com wrote: You have a long lineage that causes the StackOverflow error. Try rdd.checkPoint() and rdd.count() for every 20~30 iterations. checkPoint can cut the lineage. -Xiangrui On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan gh...@lanl.gov wrote: Dear Sparkers

java.lang.StackOverflowError when calling count()

2014-05-12 Thread Guanhua Yan
Dear Sparkers: I am using Python spark of version 0.9.0 to implement some iterative algorithm. I got some errors shown at the end of this email. It seems that it's due to the Java Stack Overflow error. The same error has been duplicated on a mac desktop and a linux workstation, both running the

Python Spark on YARN

2014-04-29 Thread Guanhua Yan
Hi all: Is it possible to develop Spark programs in Python and run them on YARN? From the Python SparkContext class, it doesn't seem to have such an option. Thank you, - Guanhua === Guanhua Yan, Ph.D. Information Sciences Group (CCS-3) Los Alamos National Laboratory

Re: Python Spark on YARN

2014-04-29 Thread Guanhua Yan
: https://github.com/apache/spark/pull/30 Matei On Apr 29, 2014, at 9:51 AM, Guanhua Yan gh...@lanl.gov wrote: Hi all: Is it possible to develop Spark programs in Python and run them on YARN? From the Python SparkContext class, it doesn't seem to have such an option. Thank you, - Guanhua