[scikit-learn] help-Renaming features in Sckit-learn's CountVectorizer()

2018-03-05 Thread Ranjana Girish
Hai all,

I have a very large pandas dataframe. Below is the sample

   * Id  description*
1switvch for air conditioner transformer..
2control tfrmr...
3coling pad.
4DRLG machine
5hair smothing kit...

For further process, I will contruct doument-term matrix of above data
using Sckit-learn's countvectorizer

*countvec = CountVectorizer()*
*documenttermmatrix=countvec.fit_transform(  dataset['description'])*

I have to correct misspelled features in description. Replacing wrongly
spelled word with correctly spelled word  for large dataset is taking so
much of time.

So i thought of  correcting features using features list in count
vectorizer given by code

*features_names= **countvec.get_feature_names()*

*Is it possible to rename features using above list and further use it for
classification process???*

Thanks
Ranjana
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Text classification of large dataset

2017-12-27 Thread Ranjana Girish
Hai all,

Thank you for your suggestions.

But I am still getting *memory error* while doing feature selection

*fs = feature_selection.SelectPercentile(feature_selection.chi2,
percentile=20)*
*documenttermmatrix1 = fs.fit_transform(documenttermmatrix,y1)*


*documenttermmatrix* will be of shape *(1594516,232832)*
type of *documenttermmatrix * is *scipy csr matrix*

Am I doing anything wrong?

Is there any better way of doing feature selection?
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Text classification of large dataet

2017-12-19 Thread Ranjana Girish
Hai all,

I am doing text classification. I have around 10 million data to be
classified to around 7k category.

Below is the code I am using

*# Importing the libraries*
*import pandas as pd*
*import nltk*
*from nltk.corpus import stopwords*
*from nltk.tokenize import word_tokenize*
*from nltk.stem.wordnet import WordNetLemmatizer*
*from nltk.stem.porter import PorterStemmer*
*import re*
*from sklearn.feature_extraction.text import CountVectorizer*
*import random*
*from sklearn.naive_bayes import MultinomialNB,GaussianNB*
*from sklearn.metrics import accuracy_score*
*from sklearn.metrics import precision_recall_curve*
*from sklearn.metrics import average_precision_score*
*from sklearn import feature_selection*
*from scipy.sparse import csr_matrix*
*from scipy import sparse*
*import sys*
*from sklearn import preprocessing*
*import numpy as np*
*import pickle*

*sys.setrecursionlimit(2)*

*random.seed(2)*


*trainset1=pd.read_csv("trainsetgrt500sample10.csv",encoding =
"ISO-8859-1")*
*trainset2=pd.read_csv("trainsetlessequal500.csv",encoding = "ISO-8859-1")*

*dataset=pd.concat([trainset1,trainset2])*

*dataset=dataset.dropna()*

*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[^a-zA-Z]',
' ')*
*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[\d]',
' ')*
*dataset['ProductDescription']=dataset['ProductDescription'].str.lower()*

*del trainset1*
*del trainset2  *

*stop = stopwords.words('english')*
*lemmatizer = WordNetLemmatizer()*

*dataset['ProductDescription']=dataset['ProductDescription'].str.replace(r'\b('
+ r'|'.join(stop) + r')\b\s*', ' ')*
*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('\s\s+','
')*
*dataset['ProductDescription']
=dataset['ProductDescription'].apply(word_tokenize)*
*ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'*
*POS_LIST = [NOUN, VERB, ADJ, ADV]*
*for tag in POS_LIST:*
*dataset['ProductDescription'] =
dataset['ProductDescription'].apply(lambda x:
list(set([lemmatizer.lemmatize(item,tag) for item in x])))*
*dataset['ProductDescription']=dataset['ProductDescription'].apply(lambda x
: " ".join(x))*

*countvec = CountVectorizer(min_df=0.8)*
*documenttermmatrix=countvec.fit_transform(dataset['ProductDescription'])*
*documenttermmatrix.shape*
*column=countvec.get_feature_names()*
*filename1 = 'columnnamessample10mastermerge.sav'*
*pickle.dump(column, open(filename1, 'wb'))*

*y_train=dataset['classpath']*
*y_train=dataset['classpath'].tolist()*
*labels_train= preprocessing.LabelEncoder()*
*labels_train.fit(y_train)*
*y1_train=labels_train.transform(y_train)*

*del dataset*
*del countvec*
*del column*


*clf = MultinomialNB()*
*model=clf.fit(documenttermmatrix,y_train)*





*filename2 = 'modelnaivebayessample10withfs.sav'*
*pickle.dump(model, open(filename2, 'wb'))*


I am using system with *128 GB RAM.*

As I was unable to train all 10 million data, I did *stratified sampling*
and the trainset reduced to 2.3 million

Still I was unable to Train  2.3 million data

I got* memory error* when i used *random forest (nestimator=30),**Naive
Bayes* and *SVM*



*I have stucked*



*Can Anyone please tell whether any memory leak in my code and  how to use
system with 128 GB RAM effectively*


Thanks
Ranjana
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn