[scikit-learn] help-Renaming features in Sckit-learn's CountVectorizer()

2018-03-05 Thread Ranjana Girish
Hai all,

I have a very large pandas dataframe. Below is the sample

   * Id  description*
1switvch for air conditioner transformer..
2control tfrmr...
3coling pad.
4DRLG machine
5hair smothing kit...

For further process, I will contruct doument-term matrix of above data
using Sckit-learn's countvectorizer

*countvec = CountVectorizer()*
*documenttermmatrix=countvec.fit_transform(  dataset['description'])*

I have to correct misspelled features in description. Replacing wrongly
spelled word with correctly spelled word  for large dataset is taking so
much of time.

So i thought of  correcting features using features list in count
vectorizer given by code

*features_names= **countvec.get_feature_names()*

*Is it possible to rename features using above list and further use it for
classification process???*

scikit-learn mailing list

Re: [scikit-learn] Text classification of large dataset

2017-12-27 Thread Ranjana Girish
Hai all,

Thank you for your suggestions.

But I am still getting *memory error* while doing feature selection

*fs = feature_selection.SelectPercentile(feature_selection.chi2,
*documenttermmatrix1 = fs.fit_transform(documenttermmatrix,y1)*

*documenttermmatrix* will be of shape *(1594516,232832)*
type of *documenttermmatrix * is *scipy csr matrix*

Am I doing anything wrong?

Is there any better way of doing feature selection?
scikit-learn mailing list

[scikit-learn] Text classification of large dataet

2017-12-19 Thread Ranjana Girish
Hai all,

I am doing text classification. I have around 10 million data to be
classified to around 7k category.

Below is the code I am using

*# Importing the libraries*
*import pandas as pd*
*import nltk*
*from nltk.corpus import stopwords*
*from nltk.tokenize import word_tokenize*
*from nltk.stem.wordnet import WordNetLemmatizer*
*from nltk.stem.porter import PorterStemmer*
*import re*
*from sklearn.feature_extraction.text import CountVectorizer*
*import random*
*from sklearn.naive_bayes import MultinomialNB,GaussianNB*
*from sklearn.metrics import accuracy_score*
*from sklearn.metrics import precision_recall_curve*
*from sklearn.metrics import average_precision_score*
*from sklearn import feature_selection*
*from scipy.sparse import csr_matrix*
*from scipy import sparse*
*import sys*
*from sklearn import preprocessing*
*import numpy as np*
*import pickle*



*trainset1=pd.read_csv("trainsetgrt500sample10.csv",encoding =
*trainset2=pd.read_csv("trainsetlessequal500.csv",encoding = "ISO-8859-1")*



' ')*
' ')*

*del trainset1*
*del trainset2  *

*stop = stopwords.words('english')*
*lemmatizer = WordNetLemmatizer()*

+ r'|'.join(stop) + r')\b\s*', ' ')*
*ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'*
*for tag in POS_LIST:*
*dataset['ProductDescription'] =
dataset['ProductDescription'].apply(lambda x:
list(set([lemmatizer.lemmatize(item,tag) for item in x])))*
*dataset['ProductDescription']=dataset['ProductDescription'].apply(lambda x
: " ".join(x))*

*countvec = CountVectorizer(min_df=0.8)*
*filename1 = 'columnnamessample10mastermerge.sav'*
*pickle.dump(column, open(filename1, 'wb'))*

*labels_train= preprocessing.LabelEncoder()*

*del dataset*
*del countvec*
*del column*

*clf = MultinomialNB()*

*filename2 = 'modelnaivebayessample10withfs.sav'*
*pickle.dump(model, open(filename2, 'wb'))*

I am using system with *128 GB RAM.*

As I was unable to train all 10 million data, I did *stratified sampling*
and the trainset reduced to 2.3 million

Still I was unable to Train  2.3 million data

I got* memory error* when i used *random forest (nestimator=30),**Naive
Bayes* and *SVM*

*I have stucked*

*Can Anyone please tell whether any memory leak in my code and  how to use
system with 128 GB RAM effectively*

scikit-learn mailing list