I guess I've been called to the chalkboard...
I haven't looked specifically at putting the taxonomy in Lucene/Solr,
but it is an interesting idea. In reading the paper you mentioned,
there are some interesting ideas there and Solr could obviously just
as easily be used as Lucene, I think.
One of the things I am interested in is the marriage of Solr and
Mahout (which has some Genetic Algorithms support) and other ML (Weka,
etc.) tools. So, for instance in the paper, they have multiple
indexes, one for negative and positive sets, well that could be done
with Solr cores or just through intelligent filtering. Then, you
could have Mahout work do it's training/clustering/whatever in the
background as needed just by sending a ReqHandler commands and output
it's model that can be shared on the "output" side so that you can
nicely serve up your results as part of search results or even
standalone, so either as a SearchComponent or from the ReqHandler. Of
course, the tricky part is in the implementation and managing the
memory, threading, etc.
Things that can help with all this: LukeReqHandler,
TermVectorComponent, TermsComponent, others
As for Hannes question about "Why Solr" I think you can still get
close to the metal w/ Solr just as Lucene, but now you have the built
in framework that makes experimentation so much easier, IMO, plus you
have all the features that Solr has to offer. For instance, a
reasonable thing to do with the output from the classification is, of
course, to facet on them.
Neal, what did you have in mind for a JIRA issue? I'd love to see a
patch.
On Jan 26, 2009, at 12:29 PM, Neal Richter wrote:
Hey all,
I'm in the processing of implementing a system to do 'text
classification' with Solr. The basic idea is to take an
ontology/taxonomy like dmoz of {label: "X", tags: "a,b,c,d,e"}, index
it and then classify documents into the taxonomy by pushing parsed
document into the Solr search API. Why? Lucene/Solr's ability to do
weighted term boosting at both search and index time has lots of
obvious uses here.
Has anyone worked on this or a similar project yet? I've seen some
talk on the list about this area but it's pretty thin... December
thread "Taxonomy Support on Solr". I'm assuming Grant Ingersoll is
looking at similar things with his 'taming text' project.
I store the 'documents' in another repository and they are far too
dynamic (write intensive) for direct indexing in Solr... so the
previously suggested procedure of 1) store document 2) execute
more-like-this and 3) delete document would be too slow.
If people are interested I could start a JIRA issue on this (I do not
see anything there at the moment).
Thanks - Neal Richter
http://aicoder.blogspot.com
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ