Re: aggregateByKey vs combineByKey

2014-09-29 Thread David Rowe
u need to implement three functions: createCombiner, > mergeValue, mergeCombiners. > > Hope this helps! > Liquan > > On Sun, Sep 28, 2014 at 11:59 PM, David Rowe wrote: > >> Hi All, >> >> After some hair pulling, I've reached the realisation that an operation I

aggregateByKey vs combineByKey

2014-09-29 Thread David Rowe
Hi All, After some hair pulling, I've reached the realisation that an operation I am currently doing via: myRDD.groupByKey.mapValues(func) should be done more efficiently using aggregateByKey or combineByKey. Both of these methods would do, and they seem very similar to me in terms of their func

Re: Where can I find the module diagram of SPARK?

2014-09-23 Thread David Rowe
Hi Andrew, I can't speak for Theodore, but I would find that incredibly useful. Dave On Wed, Sep 24, 2014 at 11:24 AM, Andrew Ash wrote: > Hi Theodore, > > What do you mean by module diagram? A high level architecture diagram of > how the classes are organized into packages? > > Andrew > > On

Re: Issues with partitionBy: FetchFailed

2014-09-22 Thread David Rowe
y be different from the previous code, > I guess probably some potential bugs may introduced. > > > > Thanks > > Jerry > > > > *From:* David Rowe [mailto:davidr...@gmail.com] > *Sent:* Monday, September 22, 2014 7:12 PM > *To:* Andrew Ash > *Cc:* Shao, Saisai;

Re: Issues with partitionBy: FetchFailed

2014-09-22 Thread David Rowe
;m seeing the same using Spark SQL on 1.1.0 -- I think there may have > been a regression in 1.1 because the same SQL query works on the same > cluster when back on 1.0.2 > > Thanks! > Andrew > > On Sun, Sep 21, 2014 at 5:15 AM, David Rowe wrote: > >> Hi, >> &g

Re: Issues with partitionBy: FetchFailed

2014-09-21 Thread David Rowe
Hi, I've seen this problem before, and I'm not convinced it's GC. When spark shuffles it writes a lot of small files to store the data to be sent to other executors (AFAICT). According to what I've read around the place the intention is that these files be stored in disk buffers, and since sync()

Re: Computing mean and standard deviation by key

2014-09-12 Thread David Rowe
Oh I see, I think you're trying to do something like (in SQL): SELECT order, mean(price) FROM orders GROUP BY order In this case, I'm not aware of a way to use the DoubleRDDFunctions, since you have a single RDD of pairs where each pair is of type (KeyType, Iterable[Double]). It seems to me that

Re: Computing mean and standard deviation by key

2014-09-11 Thread David Rowe
I generally call values.stats, e.g.: val stats = myPairRdd.values.stats On Fri, Sep 12, 2014 at 4:46 PM, rzykov wrote: > Is it possible to use DoubleRDDFunctions > < > https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/rdd/DoubleRDDFunctions.html > > > for calculating mean and std d