Lucene provides these vectors as 'term vectors' or 'term frequency vectors'. The MoreLikeThis feature does queries against these (I think).
http://www.lucidimagination.com/search/?q=term+vectors http://www.lucidimagination.com/search/?q=MoreLikeThis On Mon, May 14, 2012 at 11:07 AM, Dmitry Kan <dmitry....@gmail.com> wrote: > Peyman, > > Did you have a look at this? > > https://issues.apache.org/jira/browse/LUCENE-2959 > > the pluggable ranking functions. Can be a good starting point for you. > > Dmitry > > On Mon, Apr 23, 2012 at 7:29 PM, Peyman Faratin <pey...@robustlinks.com>wrote: > >> Hi >> >> Has there been any work that tries to integrate Kernel methods [1] with >> SOLR? I am interested in using kernel methods to solve synonym, hyponym and >> polysemous (disambiguation) problems which SOLR's Vector space model ("bag >> of words") does not capture. >> >> For example, imagine we have only 3 words in our corpus, "puma", "cougar" >> and "feline". The 3 words have obviously interdependencies (puma >> disambiguates to cougar, cougar and puma are instances of felines - >> hyponyms). Now, imagine 2 docs, d1 and d2, that have the following TF-IDF >> vectors. >> >> puma, cougar, feline >> d1 = [ 2, 0, 0] >> d2 = [ 0, 1, 0] >> >> i.e. d1 has no mention of term cougar or feline and conversely, d2 has no >> mention of terms puma or feline. Hence under the vector approach d1 and d2 >> are not related at all (and each interpretation of the terms have a unique >> vector). Which is not what we want to conclude. >> >> What I need is to include a kernel matrix (as data) such as the following >> that captures these relationships: >> >> puma, cougar, feline >> puma = [ 1, 1, 0.4] >> cougar = [ 1, 1, 0.4] >> feline = [ 0.4, 0.4, 1] >> >> then recompute the TF-IDF vector as a product of (1) the original vector >> and (2) the kernel matrix, resulting in >> >> puma, cougar, feline >> d1 = [ 2, 2, 0.8] >> d2 = [ 1, 1, 0.4] >> >> (note, the new vectors are much less sparse). >> >> I can solve this problem (inefficiently) at the application layer but I >> was wondering if there has been any attempts within the community to solve >> similar problems, efficiently without paying a hefty response time price? >> >> thank you >> >> Peyman >> >> [1] http://en.wikipedia.org/wiki/Kernel_methods > > > > > -- > Regards, > > Dmitry Kan -- Lance Norskog goks...@gmail.com