Re: [scikit-learn] Text classification of large dataet

2017-12-20 Thread Joel Nothman
To clarify: You have 2.3M samples How many features? How many active features on average per sample? In 7k classes: multiclass or multilabel? Have you tried limiting the depth of the forest? Have you tried embedding your feature space into a smaller vector (pre-trained embeddings, hashing, lda,

Re: [scikit-learn] Text classification of large dataet

2017-12-20 Thread Roman Yurchak
Ranjana, have a look at this example http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html Since you have a lot of RAM, you may not need to make all the classification pipeline out-of-core, a start with your current code could be to write a generator

[scikit-learn] Text classification of large dataet

2017-12-19 Thread Ranjana Girish
Hai all, I am doing text classification. I have around 10 million data to be classified to around 7k category. Below is the code I am using *# Importing the libraries* *import pandas as pd* *import nltk* *from nltk.corpus import stopwords* *from nltk.tokenize import word_tokenize* *from