Peyman, Did you have a look at this?
https://issues.apache.org/jira/browse/LUCENE-2959 the pluggable ranking functions. Can be a good starting point for you. Dmitry On Mon, Apr 23, 2012 at 7:29 PM, Peyman Faratin <pey...@robustlinks.com>wrote: > Hi > > Has there been any work that tries to integrate Kernel methods [1] with > SOLR? I am interested in using kernel methods to solve synonym, hyponym and > polysemous (disambiguation) problems which SOLR's Vector space model ("bag > of words") does not capture. > > For example, imagine we have only 3 words in our corpus, "puma", "cougar" > and "feline". The 3 words have obviously interdependencies (puma > disambiguates to cougar, cougar and puma are instances of felines - > hyponyms). Now, imagine 2 docs, d1 and d2, that have the following TF-IDF > vectors. > > puma, cougar, feline > d1 = [ 2, 0, 0] > d2 = [ 0, 1, 0] > > i.e. d1 has no mention of term cougar or feline and conversely, d2 has no > mention of terms puma or feline. Hence under the vector approach d1 and d2 > are not related at all (and each interpretation of the terms have a unique > vector). Which is not what we want to conclude. > > What I need is to include a kernel matrix (as data) such as the following > that captures these relationships: > > puma, cougar, feline > puma = [ 1, 1, 0.4] > cougar = [ 1, 1, 0.4] > feline = [ 0.4, 0.4, 1] > > then recompute the TF-IDF vector as a product of (1) the original vector > and (2) the kernel matrix, resulting in > > puma, cougar, feline > d1 = [ 2, 2, 0.8] > d2 = [ 1, 1, 0.4] > > (note, the new vectors are much less sparse). > > I can solve this problem (inefficiently) at the application layer but I > was wondering if there has been any attempts within the community to solve > similar problems, efficiently without paying a hefty response time price? > > thank you > > Peyman > > [1] http://en.wikipedia.org/wiki/Kernel_methods -- Regards, Dmitry Kan