Not sure if this is always ideal for Naive Bayes, but you could also hash the 
features into a lower-dimensional space (e.g. reduce it to 50,000 features). 
For each feature simply take MurmurHash3(featureID) % 50000 for example.

Matei

On Apr 27, 2014, at 11:24 PM, DB Tsai <dbt...@stanford.edu> wrote:

> Our customer asked us to implement Naive Bayes which should be able to at 
> least train news20 one year ago, and we implemented for them in Hadoop using 
> distributed cache to store the model.
> 
> 
> Sincerely,
> 
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
> 
> 
> On Sun, Apr 27, 2014 at 11:03 PM, Xiangrui Meng <men...@gmail.com> wrote:
> How big is your problem and how many labels? -Xiangrui
> 
> On Sun, Apr 27, 2014 at 10:28 PM, DB Tsai <dbt...@stanford.edu> wrote:
> > Hi Xiangrui,
> >
> > We also run into this issue at Alpine Data Labs. We ended up using LRU cache
> > to store the counts, and splitting those least used counts to distributed
> > cache in HDFS.
> >
> >
> > Sincerely,
> >
> > DB Tsai
> > -------------------------------------------------------
> > My Blog: https://www.dbtsai.com
> > LinkedIn: https://www.linkedin.com/in/dbtsai
> >
> >
> > On Sun, Apr 27, 2014 at 7:34 PM, Xiangrui Meng <men...@gmail.com> wrote:
> >>
> >> Even the features are sparse, the conditional probabilities are stored
> >> in a dense matrix. With 200 labels and 2 million features, you need to
> >> store at least 4e8 doubles on the driver node. With multiple
> >> partitions, you may need more memory on the driver. Could you try
> >> reducing the number of partitions and giving driver more ram and see
> >> whether it can help? -Xiangrui
> >>
> >> On Sun, Apr 27, 2014 at 3:33 PM, John King <usedforprinting...@gmail.com>
> >> wrote:
> >> > I'm already using the SparseVector class.
> >> >
> >> > ~200 labels
> >> >
> >> >
> >> > On Sun, Apr 27, 2014 at 12:26 AM, Xiangrui Meng <men...@gmail.com>
> >> > wrote:
> >> >>
> >> >> How many labels does your dataset have? -Xiangrui
> >> >>
> >> >> On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai <dbt...@stanford.edu> wrote:
> >> >> > Which version of mllib are you using? For Spark 1.0, mllib will
> >> >> > support sparse feature vector which will improve performance a lot
> >> >> > when computing the distance between points and centroid.
> >> >> >
> >> >> > Sincerely,
> >> >> >
> >> >> > DB Tsai
> >> >> > -------------------------------------------------------
> >> >> > My Blog: https://www.dbtsai.com
> >> >> > LinkedIn: https://www.linkedin.com/in/dbtsai
> >> >> >
> >> >> >
> >> >> > On Sat, Apr 26, 2014 at 5:49 AM, John King
> >> >> > <usedforprinting...@gmail.com> wrote:
> >> >> >> I'm just wondering are the SparkVector calculations really taking
> >> >> >> into
> >> >> >> account the sparsity or just converting to dense?
> >> >> >>
> >> >> >>
> >> >> >> On Fri, Apr 25, 2014 at 10:06 PM, John King
> >> >> >> <usedforprinting...@gmail.com>
> >> >> >> wrote:
> >> >> >>>
> >> >> >>> I've been trying to use the Naive Bayes classifier. Each example in
> >> >> >>> the
> >> >> >>> dataset is about 2 million features, only about 20-50 of which are
> >> >> >>> non-zero,
> >> >> >>> so the vectors are very sparse. I keep running out of memory
> >> >> >>> though,
> >> >> >>> even
> >> >> >>> for about 1000 examples on 30gb RAM while the entire dataset is 4
> >> >> >>> million
> >> >> >>> examples. And I would also like to note that I'm using the sparse
> >> >> >>> vector
> >> >> >>> class.
> >> >> >>
> >> >> >>
> >> >
> >> >
> >
> >
> 

Reply via email to