No, colStats() computes all summary statistics in one pass and store
the values. It is not lazy.
On Mon, Jan 12, 2015 at 4:42 AM, Rok Roskar wrote:
> This was without using Kryo -- if I use kryo, I got errors about buffer
> overflows (see above):
>
> com.esotericsoftware.kryo.KryoException: Buffe
This was without using Kryo -- if I use kryo, I got errors about buffer
overflows (see above):
com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 5,
required: 8
Just calling colStats doesn't actually compute those statistics, does it?
It looks like the computation is only carrie
colStats() computes the mean values along with several other summary
statistics, which makes it slower. How is the performance if you don't
use kryo? -Xiangrui
On Fri, Jan 9, 2015 at 3:46 AM, Rok Roskar wrote:
> thanks for the suggestion -- however, looks like this is even slower. With
> the smal
thanks for the suggestion -- however, looks like this is even slower. With
the small data set I'm using, my aggregate function takes ~ 9 seconds and
the colStats.mean() takes ~ 1 minute. However, I can't get it to run with
the Kyro serializer -- I get the error:
com.esotericsoftware.kryo.KryoExcep
There is some serialization overhead. You can try
https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107
. -Xiangrui
On Wed, Jan 7, 2015 at 9:42 AM, rok wrote:
> I have an RDD of SparseVectors and I'd like to calculate the means returning
> a dense vector. I've tried doing