How many labels does your dataset have? -Xiangrui
On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai <dbt...@stanford.edu> wrote: > Which version of mllib are you using? For Spark 1.0, mllib will > support sparse feature vector which will improve performance a lot > when computing the distance between points and centroid. > > Sincerely, > > DB Tsai > ------------------------------------------------------- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Sat, Apr 26, 2014 at 5:49 AM, John King <usedforprinting...@gmail.com> > wrote: >> I'm just wondering are the SparkVector calculations really taking into >> account the sparsity or just converting to dense? >> >> >> On Fri, Apr 25, 2014 at 10:06 PM, John King <usedforprinting...@gmail.com> >> wrote: >>> >>> I've been trying to use the Naive Bayes classifier. Each example in the >>> dataset is about 2 million features, only about 20-50 of which are non-zero, >>> so the vectors are very sparse. I keep running out of memory though, even >>> for about 1000 examples on 30gb RAM while the entire dataset is 4 million >>> examples. And I would also like to note that I'm using the sparse vector >>> class. >> >>