Thanks a lot Nick and Oliver. To answer your questions:
- how many samples?
About 1 billion rows.
> - how many features?
>
It will depend on the nature of the analyzes. Many of the categorical
variables have taxonomies that can be used to reduce cardinality. Sometimes
I'll want to use these, other time not. I would say somewhere between 100
and 100,000.
> - how many non-zero feature per samples?
>
When using many features there will be a lot of zeros, when using few less
so. Say 10 non-zero features when using 100 features and 90,000 when using
100,000 features.
> - how many target classes? Are they balanced?
>
It's hard to say because I'm working with different problems on the same
data (also unsupervised and semi-supervised problems). The main problem
only has a single binary target class. In this case the target variable is
not balanced.
What type of data? Just floats? Binary indicator for categorical
> features, if so what is the range of cardinality for those categorical
> features?
>
All of the above. There are different categorical variables
with cardinality anywhere between 2 and 200,000.
I would also like to try NMF on the data. Do you think the scikit-learn
implementation could work with 100,000 sparse features on 1 billion rows?
Regards,
Helge
On Fri, Aug 23, 2013 at 12:37 PM, Olivier Grisel
wrote:
> 2013/8/23 [email protected] :
> > Good day,
> >
> > Can anyone perhaps give me an idea of how large datasets scikit-learn
> > algorithms typically can handle?
> >
> > I have about 4 TB of structured data. I might be able to normalize that
> down
> > to say 1 TB if necessary. The tasks would typically be logistic
> regression,
> > Naive Bayes, k-Means and possible others.
> >
> > Will scikit-learn algorithms be able to handle this on a fairly powerful
> > hardware setup?
> >
> > At which point does it become necessary to look at distributed ML
> platforms
> > e.g. Mahout instead?
>
> It really depends on what you are doing:
>
> - how many samples?
> - how many features?
> - how many non-zero feature per samples?
> - how many target classes? Are they balanced?
>
> What type of data? Just floats? Binary indicator for categorical
> features, if so what is the range of cardinality for those categorical
> features?
>
> If the number of features is fixed and small-ish (let say less than
> one million), then you can train out-of-core models such as
> SGDClassifier (with loss='log' for getting a logistic regression
> model), MultinomialNB or BernoulliNB (in 0.14 you now have partial_fit
> for those guys too) or MinibatchKMeans for unsupervised clustering.
>
> You can have a look at this example for instance:
>
> http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html#example-applications-plot-out-of-core-classification-py
>
> However it's very likely that simple models like linear models and
> naive bayes won't benefit from more than a couple of tens of millions
> of samples if the number of possible features is small.
>
> So you should first try to learn models on random subsets of your data
> that fit in memory and measure the performance of the predictive
> accuracy of the model. Then try again with a slightly bigger dataset
> (twice as big but that still fit in memory) and measure the
> performance again: if you don't see an improvement it means that you
> don't need that much data.
>
> If you really need to distribute computation on a cluster you can
> train out-of-core linear models on a Spark cluster with PySpark as
> Nick explained.
>
> --
> Olivier
>
>
> --
> Introducing Performance Central, a new site from SourceForge and
> AppDynamics. Performance Central is your source for news, insights,
> analysis and resources for efficient Application Performance Management.
> Visit us today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
> ___
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
--
Introducing Performance Central, a new site from SourceForge and
AppDynamics. Performance Central is your source for news, insights,
analysis and resources for efficient Application Performance Management.
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general