Re: Latent Semantic Analysis for Document Categorization

Ted Dunning Mon, 30 Mar 2015 14:03:51 -0700

Hersheeta,

For linking information, you should go for whatever you can find.  For
instance:


1) if the documents are HTML, href elements (aka web links) are an ideal
kind of linkage.  This is what Page rank was based on.

2) If the documents refer to people, places or things, then you have a
second order linkage.

3) if the documents have academic citations that you can resolve, then you
have something comparable to (1)

4) All of the documents by a single author are linked by common authorship

5) Documents viewed by the same person are linked, especially if the
documents are viewed consecutively or in a short time span.

I would expect that once you get started, you would be able to come up with
this many additional kinds of linkage or even more.

Some of these kinds of linkage might already be usable by your classifiers,
but they probably aren't highlighted as much as they might be.  For
instance, you might extract the author into a special field.  Likewise with
named entities.  For others, it is almost certain that you are not
including the information.  As an instance, outgoing references are visible
to your classifier to some degree, but incoming references are almost
certainly not visible.



On Mon, Mar 30, 2015 at 12:23 AM, Hersheeta Chandankar <
hersheetachandan...@gmail.com> wrote:

> Hi Ted,
>
> Thank you for a quick reply.
> It would be of great help if you could please explain what kind of 'linking
> information between documents' I should look for.
>
> On Fri, Mar 27, 2015 at 2:45 AM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
>
> > Also, if you can include linking information between documents, you
> should
> > be able to substantially improve accuracy.  Same goes for behavioral data
> > like browsing history.
> >
> >
> >
> > On Thu, Mar 26, 2015 at 6:10 AM, Hersheeta Chandankar <
> > hersheetachandan...@gmail.com> wrote:
> >
> > > Thank you so much Chirag and David for your suggestion.
> > > I'll surely try it.
> > >
> > > On Thu, Mar 26, 2015 at 6:31 PM, 3316 Chirag Nagpal <
> > > chiragnagpal_12...@aitpune.edu.in> wrote:
> > >
> > > > A better approach I can think of for the aformentioned task is to use
> > > > Latent Dirichlet Allocation
> > > >
> > > > You can force, LDA to learn topics with certain specific words by
> > > > assigning higher probability values to those words in the initial
> > > dirichlet
> > > > distribution.
> > > >
> > > > That way you will be able to discover topics better
> > > >
> > > > Chirag Nagpal
> > > > Department of Computer Engineering
> > > > Army Institute of Technology, Pune
> > > >
> > > > ________________________________________
> > > > From: Hersheeta Chandankar <hersheetachandan...@gmail.com>
> > > > Sent: Thursday, March 26, 2015 6:25 PM
> > > > To: user@mahout.apache.org
> > > > Subject: Latent Semantic Analysis for Document Categorization
> > > >
> > > > Hi,
> > > >
> > > > I'm working on a document categorization project wherein I have some
> > > > crawled text documents on different topics which I want to categorize
> > > into
> > > > pre-decided categories like travel,sports,education etc.
> > > > Currently the approach I've used is of building a NaiveBayes
> > > Classification
> > > > model in mahout which has given good accuracy result of 70%-75%. But
> I
> > > > would still like to improve the accuracy by retrieving the semantic
> > > > dependencies between words of the documents.
> > > > I've read about Latent Semantic Analysis(LSA) which creates a
> > > term-document
> > > > matrix and subjects it to mathematical transformation called Singular
> > > Value
> > > > Decomposition(SVD).
> > > > I'd thought of firstly subjecting the raw documents to LSA followed
> by
> > > > k-means clustering on LSA output and then giving the clustered output
> > as
> > > > input to the NaiveBayes Classifier.
> > > > But on trying out LSA in Mahout the end result seemed to be in
> > numerical
> > > > format and which after clustering were not acceptable by the
> NaiveBayes
> > > > classifier.
> > > >
> > > > Is my expirimental approach wrong? Has anybody worked on a similar
> > issue
> > > > like this?
> > > > Could someone help me with the implementation of LSA or suggest any
> > other
> > > > approach for semantic analysis of text documents.
> > > >
> > > > Thanks
> > > > -Hersheeta
> > > >
> > >
> >
>

Re: Latent Semantic Analysis for Document Categorization

Reply via email to