Hi Andrew, Yes, I am vectorizing the new text documents with the same dictionary file used to train the model.
Thanks, -Hersheeta On Fri, Oct 24, 2014 at 10:58 PM, Andrew Palumbo <[email protected]> wrote: > > Hello Hersheeta, > > Are you vectorizing the new text using the same dictionary as you used to > train the models? If not, this will likely severely impact the performance > of the classifier. > > > > > Date: Fri, 24 Oct 2014 21:28:06 +0530 > > Subject: Categorization of documents using clustering and classification > > From: [email protected] > > To: [email protected] > > > > Hi, > > > > I have a collection of crawled text documents on different topics which I > > want to categorize into pre-decided categories like > travel,sports,education > > etc. > > For this I've firstly clustered these documents using k-means clustering > > and then built a complimentary-naive bayes model of these clustered > > documents. > > The accuracy and reliability of the model was 83% & 63% respectively. > > Now the problem is that, on deploying the model the results recorded are > > absurd > > (eg- A sports document is categorized under business category). > > On analyzing the problem, I found that the clusters formed were not clean > > (contained unrelated documents) which may have led to creation of wrong > > dictionary file. > > > > In order to avoid this, is there any other way to get the input data > > preprocessed and clustered ? > > or > > Is there any other alternative approach that could be used for the > > categorization? > > > > Thanks, > > -Hersheeta > >
