To clarify:
You have 2.3M samples
How many features?
How many active features on average per sample?
In 7k classes: multiclass or multilabel?
Have you tried limiting the depth of the forest? Have you tried embedding
your feature space into a smaller vector (pre-trained embeddings, hashing,
lda,
Ranjana,
have a look at this example
http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html
Since you have a lot of RAM, you may not need to make all the
classification pipeline out-of-core, a start with your current code
could be to write a generator
Hai all,
I am doing text classification. I have around 10 million data to be
classified to around 7k category.
Below is the code I am using
*# Importing the libraries*
*import pandas as pd*
*import nltk*
*from nltk.corpus import stopwords*
*from nltk.tokenize import word_tokenize*
*from