Re: calculating the mean of SparseVector RDD

2015-01-12 Thread Xiangrui Meng
No, colStats() computes all summary statistics in one pass and store the values. It is not lazy. On Mon, Jan 12, 2015 at 4:42 AM, Rok Roskar wrote: > This was without using Kryo -- if I use kryo, I got errors about buffer > overflows (see above): > > com.esotericsoftware.kryo.KryoException: Buffe

Re: calculating the mean of SparseVector RDD

2015-01-12 Thread Rok Roskar
This was without using Kryo -- if I use kryo, I got errors about buffer overflows (see above): com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 5, required: 8 Just calling colStats doesn't actually compute those statistics, does it? It looks like the computation is only carrie

Re: calculating the mean of SparseVector RDD

2015-01-09 Thread Xiangrui Meng
colStats() computes the mean values along with several other summary statistics, which makes it slower. How is the performance if you don't use kryo? -Xiangrui On Fri, Jan 9, 2015 at 3:46 AM, Rok Roskar wrote: > thanks for the suggestion -- however, looks like this is even slower. With > the smal

Re: calculating the mean of SparseVector RDD

2015-01-09 Thread Rok Roskar
thanks for the suggestion -- however, looks like this is even slower. With the small data set I'm using, my aggregate function takes ~ 9 seconds and the colStats.mean() takes ~ 1 minute. However, I can't get it to run with the Kyro serializer -- I get the error: com.esotericsoftware.kryo.KryoExcep

Re: calculating the mean of SparseVector RDD

2015-01-07 Thread Xiangrui Meng
There is some serialization overhead. You can try https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107 . -Xiangrui On Wed, Jan 7, 2015 at 9:42 AM, rok wrote: > I have an RDD of SparseVectors and I'd like to calculate the means returning > a dense vector. I've tried doing