Hi Andrew,
Yes, I am vectorizing the new text documents with the same dictionary file
used to train the model.

Thanks,
-Hersheeta

On Fri, Oct 24, 2014 at 10:58 PM, Andrew Palumbo <[email protected]> wrote:

>
> Hello Hersheeta,
>
> Are you vectorizing the new text using the same dictionary as you used to
> train the models?  If not, this will likely severely impact the performance
> of the classifier.
>
>
>
> > Date: Fri, 24 Oct 2014 21:28:06 +0530
> > Subject: Categorization of documents using clustering and classification
> > From: [email protected]
> > To: [email protected]
> >
> > Hi,
> >
> > I have a collection of crawled text documents on different topics which I
> > want to categorize into pre-decided categories like
> travel,sports,education
> > etc.
> > For this I've firstly clustered these documents using k-means clustering
> > and then built a complimentary-naive bayes model of these clustered
> > documents.
> > The accuracy and reliability of the model was 83% & 63% respectively.
> > Now the problem is that, on deploying the model the results recorded are
> > absurd
> > (eg- A sports document is categorized under business category).
> > On analyzing the problem, I found that the clusters formed were not clean
> > (contained unrelated documents) which may have led to creation of wrong
> > dictionary file.
> >
> > In order to avoid this, is there any other way to get the input data
> > preprocessed and clustered ?
> > or
> > Is there any other alternative approach that could be used for the
> > categorization?
> >
> > Thanks,
> > -Hersheeta
>
>

Reply via email to