Re: [scikit-learn] Text classification of large dataet

2017-12-20 Thread Joel Nothman
To clarify:
You have 2.3M samples
How many features?
How many active features on average per sample?
In 7k classes: multiclass or multilabel?

Have you tried limiting the depth of the forest? Have you tried embedding
your feature space into a smaller vector (pre-trained embeddings, hashing,
lda, PCA or random projection)?
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Text classification of large dataet

2017-12-20 Thread Roman Yurchak

Ranjana,

have a look at this example 
http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html


Since you have a lot of RAM, you may not need to make all the 
classification pipeline out-of-core, a start with your current code 
could be to write a generator that loads and pre-processes the text in 
chunks then feed it one document at the time to CountVecotorizer.fit (it 
accepts an iterable). To reduce the memory usage, filtering too frequent 
tokens (instead of the infrequent ones) could help too. Make sure you L2 
normalize your data before the classifier. You could use 
SGDClassifier(loss='log') or LogisticRegression with a sag or saga 
solver. The multiclasss="multinomial" parameter might be also worth 
trying, particularly since you have so many classes.


--
Roman

On 19/12/17 15:38, Ranjana Girish wrote:

Hai all,

I am doing text classification. I have around 10 million data to be
classified to around 7k category.

Below is the code I am using

/# Importing the libraries/
/i*mport pandas as pd*/
/*import nltk*/
/*from nltk.corpus import stopwords*/
/*from nltk.tokenize import word_tokenize*/
/*from nltk.stem.wordnet import WordNetLemmatizer*/
/*from nltk.stem.porter import PorterStemmer*/
/*import re*/
/*from sklearn.feature_extraction.text import CountVectorizer*/
/*import random*/
/*from sklearn.naive_bayes import MultinomialNB,GaussianNB*/
/*from sklearn.metrics import accuracy_score*/
/*from sklearn.metrics import precision_recall_curve*/
/*from sklearn.metrics import average_precision_score*/
/*from sklearn import feature_selection*/
/*from scipy.sparse import csr_matrix*/
/*from scipy import sparse*/
/*import sys*/
/*from sklearn import preprocessing*/
/*import numpy as np*/
/*import pickle*/
/* */
/*sys.setrecursionlimit(2)*/
/*
*/
/*random.seed(2)*/
/*
*/
/*
*/
/*trainset1=pd.read_csv("trainsetgrt500sample10.csv",encoding =
"ISO-8859-1")*/
/*trainset2=pd.read_csv("trainsetlessequal500.csv",encoding =
"ISO-8859-1")*/
/*
*/
/*dataset=pd.concat([trainset1,trainset2])*/
/*
*/
/*dataset=dataset.dropna()*/
/*
*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[^a-zA-Z]',
' ')*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[\d]',
' ')*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.lower()*/
/*
*/
/*del trainset1*/
/*del trainset2  */
/*
*/
/*stop = stopwords.words('english')*/
/*lemmatizer = WordNetLemmatizer()*/
/*
*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.replace(r'\b('
+ r'|'.join(stop) + r')\b\s*', ' ')*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('\s\s+','
')*/
/*dataset['ProductDescription']
=dataset['ProductDescription'].apply(word_tokenize)*/
/*ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'*/
/*POS_LIST = [NOUN, VERB, ADJ, ADV]*/
/*for tag in POS_LIST:*/
/*dataset['ProductDescription'] =
dataset['ProductDescription'].apply(lambda x:
list(set([lemmatizer.lemmatize(item,tag) for item in x])))*/
/*dataset['ProductDescription']=dataset['ProductDescription'].apply(lambda
x : " ".join(x))*/
/*
*/
/*countvec = CountVectorizer(min_df=0.8)*/
/*documenttermmatrix=countvec.fit_transform(dataset['ProductDescription'])*/
/*documenttermmatrix.shape*/
/*column=countvec.get_feature_names()*/
/*filename1 = 'columnnamessample10mastermerge.sav'*/
/*pickle.dump(column, open(filename1, 'wb'))*/
/*
*/
/*y_train=dataset['classpath']*/
/*y_train=dataset['classpath'].tolist()*/
/*labels_train= preprocessing.LabelEncoder()*/
/*labels_train.fit(y_train)*/
/*y1_train=labels_train.transform(y_train)*/
/*
*/
/*del dataset*/
/*del countvec*/
/*del column*/
/*
*/
/*
*/
/*clf = MultinomialNB()*/
/*model=clf.fit(documenttermmatrix,y_train)*/
/*
*/
/*
*/
/*
*/
*
*
/*
*/
/*filename2 = 'modelnaivebayessample10withfs.sav'*/
/*pickle.dump(model, open(filename2, 'wb'))*/
/
/
/
/
I am using system with *128 GB RAM.*

As I was unable to train all 10 million data, I did *stratified
sampling* and the trainset reduced to 2.3 million

Still I was unable to Train  2.3 million data

I got*memory error* when i used *random forest (nestimator=30),**Naive
Bayes* and *SVM*


/
/
*I have stucked*
*
*
*
*

*Can Anyone please tell whether any memory leak in my code and  how to
use system with 128 GB RAM effectively*


Thanks
Ranjana



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Text classification of large dataet

2017-12-19 Thread Ranjana Girish
Hai all,

I am doing text classification. I have around 10 million data to be
classified to around 7k category.

Below is the code I am using

*# Importing the libraries*
*import pandas as pd*
*import nltk*
*from nltk.corpus import stopwords*
*from nltk.tokenize import word_tokenize*
*from nltk.stem.wordnet import WordNetLemmatizer*
*from nltk.stem.porter import PorterStemmer*
*import re*
*from sklearn.feature_extraction.text import CountVectorizer*
*import random*
*from sklearn.naive_bayes import MultinomialNB,GaussianNB*
*from sklearn.metrics import accuracy_score*
*from sklearn.metrics import precision_recall_curve*
*from sklearn.metrics import average_precision_score*
*from sklearn import feature_selection*
*from scipy.sparse import csr_matrix*
*from scipy import sparse*
*import sys*
*from sklearn import preprocessing*
*import numpy as np*
*import pickle*

*sys.setrecursionlimit(2)*

*random.seed(2)*


*trainset1=pd.read_csv("trainsetgrt500sample10.csv",encoding =
"ISO-8859-1")*
*trainset2=pd.read_csv("trainsetlessequal500.csv",encoding = "ISO-8859-1")*

*dataset=pd.concat([trainset1,trainset2])*

*dataset=dataset.dropna()*

*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[^a-zA-Z]',
' ')*
*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[\d]',
' ')*
*dataset['ProductDescription']=dataset['ProductDescription'].str.lower()*

*del trainset1*
*del trainset2  *

*stop = stopwords.words('english')*
*lemmatizer = WordNetLemmatizer()*

*dataset['ProductDescription']=dataset['ProductDescription'].str.replace(r'\b('
+ r'|'.join(stop) + r')\b\s*', ' ')*
*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('\s\s+','
')*
*dataset['ProductDescription']
=dataset['ProductDescription'].apply(word_tokenize)*
*ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'*
*POS_LIST = [NOUN, VERB, ADJ, ADV]*
*for tag in POS_LIST:*
*dataset['ProductDescription'] =
dataset['ProductDescription'].apply(lambda x:
list(set([lemmatizer.lemmatize(item,tag) for item in x])))*
*dataset['ProductDescription']=dataset['ProductDescription'].apply(lambda x
: " ".join(x))*

*countvec = CountVectorizer(min_df=0.8)*
*documenttermmatrix=countvec.fit_transform(dataset['ProductDescription'])*
*documenttermmatrix.shape*
*column=countvec.get_feature_names()*
*filename1 = 'columnnamessample10mastermerge.sav'*
*pickle.dump(column, open(filename1, 'wb'))*

*y_train=dataset['classpath']*
*y_train=dataset['classpath'].tolist()*
*labels_train= preprocessing.LabelEncoder()*
*labels_train.fit(y_train)*
*y1_train=labels_train.transform(y_train)*

*del dataset*
*del countvec*
*del column*


*clf = MultinomialNB()*
*model=clf.fit(documenttermmatrix,y_train)*





*filename2 = 'modelnaivebayessample10withfs.sav'*
*pickle.dump(model, open(filename2, 'wb'))*


I am using system with *128 GB RAM.*

As I was unable to train all 10 million data, I did *stratified sampling*
and the trainset reduced to 2.3 million

Still I was unable to Train  2.3 million data

I got* memory error* when i used *random forest (nestimator=30),**Naive
Bayes* and *SVM*



*I have stucked*



*Can Anyone please tell whether any memory leak in my code and  how to use
system with 128 GB RAM effectively*


Thanks
Ranjana
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn