Re: LSA Implementation

2007-11-28 Thread Eswar K
Lance,

It does cover European languages, but pretty much nothing on Asian languages
(CJK).

- Eswar

On Nov 28, 2007 1:51 AM, Norskog, Lance [EMAIL PROTECTED] wrote:

 WordNet itself is English-only. There are various ontology projects for
 it.

 http://www.globalwordnet.org/ is a separate world language database
 project. I found it at the bottom of the WordNet wikipedia page. Thanks
 for starting me on the search!

 Lance

 -Original Message-
 From: Eswar K [mailto:[EMAIL PROTECTED]
 Sent: Monday, November 26, 2007 6:50 PM
 To: solr-user@lucene.apache.org
 Subject: Re: LSA Implementation

 The languages also include CJK :) among others.

 - Eswar

 On Nov 27, 2007 8:16 AM, Norskog, Lance [EMAIL PROTECTED] wrote:

  The WordNet project at Princeton (USA) is a large database of
 synonyms.
  If you're only working in English this might be useful instead of
  running your own analyses.
 
  http://en.wikipedia.org/wiki/WordNet
  http://wordnet.princeton.edu/
 
  Lance
 
  -Original Message-
  From: Eswar K [mailto:[EMAIL PROTECTED]
  Sent: Monday, November 26, 2007 6:34 PM
  To: solr-user@lucene.apache.org
  Subject: Re: LSA Implementation
 
  In addition to recording which keywords a document contains, the
  method examines the document collection as a whole, to see which other

  documents contain some of those same words. this algo should consider
  documents that have many words in common to be semantically close, and

  ones with few words in common to be semantically distant. This simple
  method correlates surprisingly well with how a human being, looking at

  content, might classify a document collection. Although the algorithm
  doesn't understand anything about what the words *mean*, the patterns
  it notices can make it seem astonishingly intelligent.
 
  When you search an such  an index, the search engine looks at
  similarity values it has calculated for every content word, and
  returns the documents that it thinks best fit the query. Because two
  documents may be semantically very close even if they do not share a
  particular keyword,
 
  Where a plain keyword search will fail if there is no exact match,
  this algo will often return relevant documents that don't contain the
  keyword at all.
 
  - Eswar
 
  On Nov 27, 2007 7:51 AM, Marvin Humphrey [EMAIL PROTECTED]
 wrote:
 
  
   On Nov 26, 2007, at 6:06 PM, Eswar K wrote:
  
We essentially are looking at having an implementation for doing
search which can return documents having conceptually similar
words without necessarily having the original word searched for.
  
   Very challenging.  Say someone searches for LSA and hits an
   archived
 
   version of the mail you sent to this list.  LSA is a reasonably
   discriminating term.  But so is Eswar.
  
   If you knew that the original term was LSA, then you might look
   for documents near it in term vector space.  But if you don't know
   the original term, only the content of the document, how do you know

   whether you should look for docs near lsa or eswar?
  
   Marvin Humphrey
   Rectangular Research
   http://www.rectangular.com/
  
  
  
 



RE: LSA Implementation

2007-11-27 Thread Norskog, Lance
WordNet itself is English-only. There are various ontology projects for
it.

http://www.globalwordnet.org/ is a separate world language database
project. I found it at the bottom of the WordNet wikipedia page. Thanks
for starting me on the search!

Lance 

-Original Message-
From: Eswar K [mailto:[EMAIL PROTECTED] 
Sent: Monday, November 26, 2007 6:50 PM
To: solr-user@lucene.apache.org
Subject: Re: LSA Implementation

The languages also include CJK :) among others.

- Eswar

On Nov 27, 2007 8:16 AM, Norskog, Lance [EMAIL PROTECTED] wrote:

 The WordNet project at Princeton (USA) is a large database of
synonyms.
 If you're only working in English this might be useful instead of 
 running your own analyses.

 http://en.wikipedia.org/wiki/WordNet
 http://wordnet.princeton.edu/

 Lance

 -Original Message-
 From: Eswar K [mailto:[EMAIL PROTECTED]
 Sent: Monday, November 26, 2007 6:34 PM
 To: solr-user@lucene.apache.org
 Subject: Re: LSA Implementation

 In addition to recording which keywords a document contains, the 
 method examines the document collection as a whole, to see which other

 documents contain some of those same words. this algo should consider 
 documents that have many words in common to be semantically close, and

 ones with few words in common to be semantically distant. This simple 
 method correlates surprisingly well with how a human being, looking at

 content, might classify a document collection. Although the algorithm 
 doesn't understand anything about what the words *mean*, the patterns 
 it notices can make it seem astonishingly intelligent.

 When you search an such  an index, the search engine looks at 
 similarity values it has calculated for every content word, and 
 returns the documents that it thinks best fit the query. Because two 
 documents may be semantically very close even if they do not share a 
 particular keyword,

 Where a plain keyword search will fail if there is no exact match, 
 this algo will often return relevant documents that don't contain the 
 keyword at all.

 - Eswar

 On Nov 27, 2007 7:51 AM, Marvin Humphrey [EMAIL PROTECTED]
wrote:

 
  On Nov 26, 2007, at 6:06 PM, Eswar K wrote:
 
   We essentially are looking at having an implementation for doing 
   search which can return documents having conceptually similar 
   words without necessarily having the original word searched for.
 
  Very challenging.  Say someone searches for LSA and hits an 
  archived

  version of the mail you sent to this list.  LSA is a reasonably 
  discriminating term.  But so is Eswar.
 
  If you knew that the original term was LSA, then you might look 
  for documents near it in term vector space.  But if you don't know 
  the original term, only the content of the document, how do you know

  whether you should look for docs near lsa or eswar?
 
  Marvin Humphrey
  Rectangular Research
  http://www.rectangular.com/
 
 
 



Re: LSA Implementation

2007-11-27 Thread Grant Ingersoll
Using Wordnet may require having some type of disambiguation approach,  
otherwise you can end up w/ a lot of synonyms.  I also would look  
into how much coverage there is for non-English languages.


If you have the resources, you may be better off developing/finding  
your own synonym/concept list based on your genres.  You may also look  
into other approaches for assigning concepts off line and adding them  
to the document.


-Grant

On Nov 27, 2007, at 3:21 PM, Norskog, Lance wrote:

WordNet itself is English-only. There are various ontology projects  
for

it.

http://www.globalwordnet.org/ is a separate world language database
project. I found it at the bottom of the WordNet wikipedia page.  
Thanks

for starting me on the search!

Lance

-Original Message-
From: Eswar K [mailto:[EMAIL PROTECTED]
Sent: Monday, November 26, 2007 6:50 PM
To: solr-user@lucene.apache.org
Subject: Re: LSA Implementation

The languages also include CJK :) among others.

- Eswar

On Nov 27, 2007 8:16 AM, Norskog, Lance [EMAIL PROTECTED] wrote:


The WordNet project at Princeton (USA) is a large database of

synonyms.

If you're only working in English this might be useful instead of
running your own analyses.

http://en.wikipedia.org/wiki/WordNet
http://wordnet.princeton.edu/

Lance

-Original Message-
From: Eswar K [mailto:[EMAIL PROTECTED]
Sent: Monday, November 26, 2007 6:34 PM
To: solr-user@lucene.apache.org
Subject: Re: LSA Implementation

In addition to recording which keywords a document contains, the
method examines the document collection as a whole, to see which  
other



documents contain some of those same words. this algo should consider
documents that have many words in common to be semantically close,  
and



ones with few words in common to be semantically distant. This simple
method correlates surprisingly well with how a human being, looking  
at



content, might classify a document collection. Although the algorithm
doesn't understand anything about what the words *mean*, the patterns
it notices can make it seem astonishingly intelligent.

When you search an such  an index, the search engine looks at
similarity values it has calculated for every content word, and
returns the documents that it thinks best fit the query. Because two
documents may be semantically very close even if they do not share a
particular keyword,

Where a plain keyword search will fail if there is no exact match,
this algo will often return relevant documents that don't contain the
keyword at all.

- Eswar

On Nov 27, 2007 7:51 AM, Marvin Humphrey [EMAIL PROTECTED]

wrote:




On Nov 26, 2007, at 6:06 PM, Eswar K wrote:


We essentially are looking at having an implementation for doing
search which can return documents having conceptually similar
words without necessarily having the original word searched for.


Very challenging.  Say someone searches for LSA and hits an
archived



version of the mail you sent to this list.  LSA is a reasonably
discriminating term.  But so is Eswar.

If you knew that the original term was LSA, then you might look
for documents near it in term vector space.  But if you don't know
the original term, only the content of the document, how do you know



whether you should look for docs near lsa or eswar?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/







--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





Re: LSA Implementation

2007-11-26 Thread Grant Ingersoll
LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is  
patented, so it is not likely to happen unless the authors donate the  
patent to the ASF.


-Grant


On Nov 26, 2007, at 8:23 AM, Eswar K wrote:


All,

Is there any plan to implement Latent Semantic Analysis as part of  
Solr

anytime in the near future?

Regards,
Eswar


--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





Re: LSA Implementation

2007-11-26 Thread Jack
Interesting. Patents are valid for 20 years so it expires next year? :)
PLSA does not seem to have been patented, at least not mentioned in
http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis

On Nov 26, 2007 6:58 AM, Grant Ingersoll [EMAIL PROTECTED] wrote:
 LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is
 patented, so it is not likely to happen unless the authors donate the
 patent to the ASF.

 -Grant



 On Nov 26, 2007, at 8:23 AM, Eswar K wrote:

  All,
 
  Is there any plan to implement Latent Semantic Analysis as part of
  Solr
  anytime in the near future?
 
  Regards,
  Eswar

 --
 Grant Ingersoll
 http://lucene.grantingersoll.com

 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ






Re: LSA Implementation

2007-11-26 Thread Eswar K
I was just searching for info on LSA and came across Semantic Indexing
project under GNU license...which of couse is still under development in C++
though.

- Eswar

On Nov 26, 2007 9:56 PM, Jack [EMAIL PROTECTED] wrote:

 Interesting. Patents are valid for 20 years so it expires next year? :)
 PLSA does not seem to have been patented, at least not mentioned in
 http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis

 On Nov 26, 2007 6:58 AM, Grant Ingersoll [EMAIL PROTECTED] wrote:
  LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is
  patented, so it is not likely to happen unless the authors donate the
  patent to the ASF.
 
  -Grant
 
 
 
  On Nov 26, 2007, at 8:23 AM, Eswar K wrote:
 
   All,
  
   Is there any plan to implement Latent Semantic Analysis as part of
   Solr
   anytime in the near future?
  
   Regards,
   Eswar
 
  --
  Grant Ingersoll
  http://lucene.grantingersoll.com
 
  Lucene Helpful Hints:
  http://wiki.apache.org/lucene-java/BasicsOfPerformance
  http://wiki.apache.org/lucene-java/LuceneFAQ
 
 
 
 



Re: LSA Implementation

2007-11-26 Thread Brian Whitman


On Nov 26, 2007 6:58 AM, Grant Ingersoll [EMAIL PROTECTED] wrote:

LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is
patented, so it is not likely to happen unless the authors donate the
patent to the ASF.

-Grant




There are many ways to catch a bird... LSA reduces to SVD on the TF  
graph. I have had limited success using JAMA's SVD, which is PD. It's  
pure java; for something serious you'd want to wrap the hard bits in  
MKL/Accelerate.


A more interesting solr related question is where a very heavy  
process like SVD would operate. You'd want to run the 'training' half  
of it separate from a indexing or querying. It'd almost be like an  
optimize. Is there any hook right now to give Solr a command like  
updateModels/ and map it to the class in the solrconfig? The  
classify half of the SVD can happen at query or index time, very  
quickly, I imagine that could even be a custom field type.




Re: LSA Implementation

2007-11-26 Thread Renaud Delbru

LDA (Latent Dirichlet Allocation) is a similar technique that extends pLSI.
You can find some implementation in C++ and Java on the Web.

Grant Ingersoll wrote:
Interesting.  I am not a lawyer, but my understanding has always been 
that this is not something we could do.


The question has come up from time to time on the Lucene mailing list:
http://www.gossamer-threads.com/lists/engine?list=lucenedo=search_resultssearch_forum=forum_3search_string=Latent+Semanticsearch_type=AND 



That being said, there may be other approaches that do similar things 
that aren't covered by a patent, I don't know.


Is there something specific you want to do, or are you just going by 
the promise of better results using LSI?


I suppose if someone said they had a patch for Lucene/Solr that 
implemented it, we could ask on legal-discuss for advice.


-Grant

On Nov 26, 2007, at 1:13 PM, Eswar K wrote:


I was just searching for info on LSA and came across Semantic Indexing
project under GNU license...which of couse is still under development 
in C++

though.

- Eswar

On Nov 26, 2007 9:56 PM, Jack [EMAIL PROTECTED] wrote:


Interesting. Patents are valid for 20 years so it expires next year? :)
PLSA does not seem to have been patented, at least not mentioned in
http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis

On Nov 26, 2007 6:58 AM, Grant Ingersoll [EMAIL PROTECTED] wrote:

LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is
patented, so it is not likely to happen unless the authors donate the
patent to the ASF.

-Grant



On Nov 26, 2007, at 8:23 AM, Eswar K wrote:


All,

Is there any plan to implement Latent Semantic Analysis as part of
Solr
anytime in the near future?

Regards,
Eswar


--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





--
Renaud Delbru,
E.C.S., M.Sc. Student,
Semantic Information Systems and
Language Engineering Group (SmILE),
Digital Enterprise Research Institute,
National University of Ireland, Galway.
http://smile.deri.ie/


Re: LSA Implementation

2007-11-26 Thread Chris Hostetter
: A more interesting solr related question is where a very heavy process like
: SVD would operate. You'd want to run the 'training' half of it separate from a
: indexing or querying. It'd almost be like an optimize. Is there any hook right
: now to give Solr a command like updateModels/ and map it to the class in
: the solrconfig? The classify half of the SVD can happen at query or index
: time, very quickly, I imagine that could even be a custom field type.

The EventListener plugin type let's you register arbitrary java code to be 
run after a commit or an optimize (before a new searcher is opened) ... 
this is the same hook mechanism that is used to trigger snapshots on 
masters and do explicit warming on slaves.

there was talk about creating a request handler that could be used to 
trigger aritrary events and xecute all of hte EventListeners (so you 
could create a new updateModels even type, independent of commit and 
optimize) but no one has ever submitted a patch...

http://issues.apache.org/jira/browse/SOLR-371




-Hoss



Re: LSA Implementation

2007-11-26 Thread Eswar K
We essentially are looking at having an implementation for doing search
which can return documents having conceptually similar words without
necessarily having the original word searched for.

- Eswar

On Nov 27, 2007 12:06 AM, Grant Ingersoll [EMAIL PROTECTED] wrote:

 Interesting.  I am not a lawyer, but my understanding has always been
 that this is not something we could do.

 The question has come up from time to time on the Lucene mailing list:

 http://www.gossamer-threads.com/lists/engine?list=lucenedo=search_resultssearch_forum=forum_3search_string=Latent+Semanticsearch_type=AND

 That being said, there may be other approaches that do similar things
 that aren't covered by a patent, I don't know.

 Is there something specific you want to do, or are you just going by
 the promise of better results using LSI?

 I suppose if someone said they had a patch for Lucene/Solr that
 implemented it, we could ask on legal-discuss for advice.

 -Grant

 On Nov 26, 2007, at 1:13 PM, Eswar K wrote:

  I was just searching for info on LSA and came across Semantic Indexing
  project under GNU license...which of couse is still under
  development in C++
  though.
 
  - Eswar
 
  On Nov 26, 2007 9:56 PM, Jack [EMAIL PROTECTED] wrote:
 
  Interesting. Patents are valid for 20 years so it expires next
  year? :)
  PLSA does not seem to have been patented, at least not mentioned in
  http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis
 
  On Nov 26, 2007 6:58 AM, Grant Ingersoll [EMAIL PROTECTED] wrote:
  LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is
  patented, so it is not likely to happen unless the authors donate
  the
  patent to the ASF.
 
  -Grant
 
 
 
  On Nov 26, 2007, at 8:23 AM, Eswar K wrote:
 
  All,
 
  Is there any plan to implement Latent Semantic Analysis as part of
  Solr
  anytime in the near future?
 
  Regards,
  Eswar
 
  --
  Grant Ingersoll
  http://lucene.grantingersoll.com
 
  Lucene Helpful Hints:
  http://wiki.apache.org/lucene-java/BasicsOfPerformance
  http://wiki.apache.org/lucene-java/LuceneFAQ
 
 
 
 
 

 --
 Grant Ingersoll
 http://lucene.grantingersoll.com

 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ






Re: LSA Implementation

2007-11-26 Thread Marvin Humphrey


On Nov 26, 2007, at 6:06 PM, Eswar K wrote:

We essentially are looking at having an implementation for doing  
search

which can return documents having conceptually similar words without
necessarily having the original word searched for.


Very challenging.  Say someone searches for LSA and hits an  
archived version of the mail you sent to this list.  LSA is a  
reasonably discriminating term.  But so is Eswar.


If you knew that the original term was LSA, then you might look for  
documents near it in term vector space.  But if you don't know the  
original term, only the content of the document, how do you know  
whether you should look for docs near lsa or eswar?


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/




Re: LSA Implementation

2007-11-26 Thread Eswar K
In addition to recording which keywords a document contains, the method
examines the document collection as a whole, to see which other documents
contain some of those same words. this algo should consider documents that
have many words in common to be semantically close, and ones with few words
in common to be semantically distant. This simple method correlates
surprisingly well with how a human being, looking at content, might classify
a document collection. Although the algorithm doesn't understand anything
about what the words *mean*, the patterns it notices can make it seem
astonishingly intelligent.

When you search an such  an index, the search engine looks at similarity
values it has calculated for every content word, and returns the documents
that it thinks best fit the query. Because two documents may be semantically
very close even if they do not share a particular keyword,

Where a plain keyword search will fail if there is no exact match, this algo
will often return relevant documents that don't contain the keyword at all.

- Eswar

On Nov 27, 2007 7:51 AM, Marvin Humphrey [EMAIL PROTECTED] wrote:


 On Nov 26, 2007, at 6:06 PM, Eswar K wrote:

  We essentially are looking at having an implementation for doing
  search
  which can return documents having conceptually similar words without
  necessarily having the original word searched for.

 Very challenging.  Say someone searches for LSA and hits an
 archived version of the mail you sent to this list.  LSA is a
 reasonably discriminating term.  But so is Eswar.

 If you knew that the original term was LSA, then you might look for
 documents near it in term vector space.  But if you don't know the
 original term, only the content of the document, how do you know
 whether you should look for docs near lsa or eswar?

 Marvin Humphrey
 Rectangular Research
 http://www.rectangular.com/





RE: LSA Implementation

2007-11-26 Thread Norskog, Lance
The WordNet project at Princeton (USA) is a large database of synonyms.
If you're only working in English this might be useful instead of
running your own analyses.

http://en.wikipedia.org/wiki/WordNet
http://wordnet.princeton.edu/

Lance

-Original Message-
From: Eswar K [mailto:[EMAIL PROTECTED] 
Sent: Monday, November 26, 2007 6:34 PM
To: solr-user@lucene.apache.org
Subject: Re: LSA Implementation

In addition to recording which keywords a document contains, the method
examines the document collection as a whole, to see which other
documents contain some of those same words. this algo should consider
documents that have many words in common to be semantically close, and
ones with few words in common to be semantically distant. This simple
method correlates surprisingly well with how a human being, looking at
content, might classify a document collection. Although the algorithm
doesn't understand anything about what the words *mean*, the patterns it
notices can make it seem astonishingly intelligent.

When you search an such  an index, the search engine looks at similarity
values it has calculated for every content word, and returns the
documents that it thinks best fit the query. Because two documents may
be semantically very close even if they do not share a particular
keyword,

Where a plain keyword search will fail if there is no exact match, this
algo will often return relevant documents that don't contain the keyword
at all.

- Eswar

On Nov 27, 2007 7:51 AM, Marvin Humphrey [EMAIL PROTECTED] wrote:


 On Nov 26, 2007, at 6:06 PM, Eswar K wrote:

  We essentially are looking at having an implementation for doing 
  search which can return documents having conceptually similar words 
  without necessarily having the original word searched for.

 Very challenging.  Say someone searches for LSA and hits an archived

 version of the mail you sent to this list.  LSA is a reasonably 
 discriminating term.  But so is Eswar.

 If you knew that the original term was LSA, then you might look for 
 documents near it in term vector space.  But if you don't know the 
 original term, only the content of the document, how do you know 
 whether you should look for docs near lsa or eswar?

 Marvin Humphrey
 Rectangular Research
 http://www.rectangular.com/





Re: LSA Implementation

2007-11-26 Thread Eswar K
The languages also include CJK :) among others.

- Eswar

On Nov 27, 2007 8:16 AM, Norskog, Lance [EMAIL PROTECTED] wrote:

 The WordNet project at Princeton (USA) is a large database of synonyms.
 If you're only working in English this might be useful instead of
 running your own analyses.

 http://en.wikipedia.org/wiki/WordNet
 http://wordnet.princeton.edu/

 Lance

 -Original Message-
 From: Eswar K [mailto:[EMAIL PROTECTED]
 Sent: Monday, November 26, 2007 6:34 PM
 To: solr-user@lucene.apache.org
 Subject: Re: LSA Implementation

 In addition to recording which keywords a document contains, the method
 examines the document collection as a whole, to see which other
 documents contain some of those same words. this algo should consider
 documents that have many words in common to be semantically close, and
 ones with few words in common to be semantically distant. This simple
 method correlates surprisingly well with how a human being, looking at
 content, might classify a document collection. Although the algorithm
 doesn't understand anything about what the words *mean*, the patterns it
 notices can make it seem astonishingly intelligent.

 When you search an such  an index, the search engine looks at similarity
 values it has calculated for every content word, and returns the
 documents that it thinks best fit the query. Because two documents may
 be semantically very close even if they do not share a particular
 keyword,

 Where a plain keyword search will fail if there is no exact match, this
 algo will often return relevant documents that don't contain the keyword
 at all.

 - Eswar

 On Nov 27, 2007 7:51 AM, Marvin Humphrey [EMAIL PROTECTED] wrote:

 
  On Nov 26, 2007, at 6:06 PM, Eswar K wrote:
 
   We essentially are looking at having an implementation for doing
   search which can return documents having conceptually similar words
   without necessarily having the original word searched for.
 
  Very challenging.  Say someone searches for LSA and hits an archived

  version of the mail you sent to this list.  LSA is a reasonably
  discriminating term.  But so is Eswar.
 
  If you knew that the original term was LSA, then you might look for
  documents near it in term vector space.  But if you don't know the
  original term, only the content of the document, how do you know
  whether you should look for docs near lsa or eswar?
 
  Marvin Humphrey
  Rectangular Research
  http://www.rectangular.com/
 
 
 



Re: LSA Implementation

2007-11-26 Thread Marvin Humphrey


On Nov 26, 2007, at 6:34 PM, Eswar K wrote:


Although the algorithm doesn't understand anything
about what the words *mean*, the patterns it notices can make it seem
astonishingly intelligent.

When you search an such  an index, the search engine looks at  
similarity
values it has calculated for every content word, and returns the  
documents
that it thinks best fit the query. Because two documents may be  
semantically

very close even if they do not share a particular keyword,

Where a plain keyword search will fail if there is no exact match,  
this algo
will often return relevant documents that don't contain the keyword  
at all.


Perhaps I should have been less curt.  I've read a few papers on LSA,  
so I'm familiar at least in passing with everything you describe  
above.  It would be entertaining to write an implementation, and I've  
considered it... but it's a low priority while the patent's in force.


A full term-vector space calculation is... expensive :) ... so LSA  
performs reduction.  Tuning the algorithm for a threshold effect not  
just against n words in common but against a rough approximation of  
n words in common is presumably non-trivial.


If you can either find or write open source software that pulls off  
such astonishingly intelligent matches despite the many challenges,  
kudos.  I'd love to see it.


Cheers,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/