I guess I've been called to the chalkboard...

I haven't looked specifically at putting the taxonomy in Lucene/Solr, but it is an interesting idea. In reading the paper you mentioned, there are some interesting ideas there and Solr could obviously just as easily be used as Lucene, I think.

One of the things I am interested in is the marriage of Solr and Mahout (which has some Genetic Algorithms support) and other ML (Weka, etc.) tools. So, for instance in the paper, they have multiple indexes, one for negative and positive sets, well that could be done with Solr cores or just through intelligent filtering. Then, you could have Mahout work do it's training/clustering/whatever in the background as needed just by sending a ReqHandler commands and output it's model that can be shared on the "output" side so that you can nicely serve up your results as part of search results or even standalone, so either as a SearchComponent or from the ReqHandler. Of course, the tricky part is in the implementation and managing the memory, threading, etc.

Things that can help with all this: LukeReqHandler, TermVectorComponent, TermsComponent, others

As for Hannes question about "Why Solr" I think you can still get close to the metal w/ Solr just as Lucene, but now you have the built in framework that makes experimentation so much easier, IMO, plus you have all the features that Solr has to offer. For instance, a reasonable thing to do with the output from the classification is, of course, to facet on them.

Neal, what did you have in mind for a JIRA issue? I'd love to see a patch.


On Jan 26, 2009, at 12:29 PM, Neal Richter wrote:

Hey all,

 I'm in the processing of implementing a system to do 'text
classification' with Solr.  The basic idea is to take an
ontology/taxonomy like dmoz of {label: "X", tags: "a,b,c,d,e"}, index
it and then classify documents into the taxonomy by pushing parsed
document into the Solr search API.  Why?  Lucene/Solr's ability to do
weighted term boosting at both search and index time has lots of
obvious uses here.

Has anyone worked on this or a similar project yet?  I've seen some
talk on the list about this area but it's pretty thin... December
thread "Taxonomy Support on Solr".  I'm assuming Grant Ingersoll is
looking at similar things with his 'taming text' project.

I store the 'documents' in another repository and they are far too
dynamic (write intensive) for direct indexing in Solr... so the
previously suggested procedure of 1) store document 2) execute
more-like-this and 3) delete document would be too slow.

If people are interested I could start a JIRA issue on this (I do not
see anything there at the moment).

Thanks - Neal Richter
http://aicoder.blogspot.com

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











Reply via email to