Re: Text classification with Solr

Grant Ingersoll Tue, 27 Jan 2009 13:22:38 -0800

I guess I've been called to the chalkboard...

I haven't looked specifically at putting the taxonomy in Lucene/Solr,but it is an interesting idea. In reading the paper you mentioned,there are some interesting ideas there and Solr could obviously justas easily be used as Lucene, I think.

One of the things I am interested in is the marriage of Solr andMahout (which has some Genetic Algorithms support) and other ML (Weka,etc.) tools. So, for instance in the paper, they have multipleindexes, one for negative and positive sets, well that could be donewith Solr cores or just through intelligent filtering. Then, youcould have Mahout work do it's training/clustering/whatever in thebackground as needed just by sending a ReqHandler commands and outputit's model that can be shared on the "output" side so that you cannicely serve up your results as part of search results or evenstandalone, so either as a SearchComponent or from the ReqHandler. Ofcourse, the tricky part is in the implementation and managing thememory, threading, etc.

Things that can help with all this: LukeReqHandler,TermVectorComponent, TermsComponent, others

As for Hannes question about "Why Solr" I think you can still getclose to the metal w/ Solr just as Lucene, but now you have the builtin framework that makes experimentation so much easier, IMO, plus youhave all the features that Solr has to offer. For instance, areasonable thing to do with the output from the classification is, ofcourse, to facet on them.

Neal, what did you have in mind for a JIRA issue? I'd love to see apatch.



On Jan 26, 2009, at 12:29 PM, Neal Richter wrote:

Hey all,

 I'm in the processing of implementing a system to do 'text
classification' with Solr.  The basic idea is to take an
ontology/taxonomy like dmoz of {label: "X", tags: "a,b,c,d,e"}, index
it and then classify documents into the taxonomy by pushing parsed
document into the Solr search API.  Why?  Lucene/Solr's ability to do
weighted term boosting at both search and index time has lots of
obvious uses here.

Has anyone worked on this or a similar project yet?  I've seen some
talk on the list about this area but it's pretty thin... December
thread "Taxonomy Support on Solr".  I'm assuming Grant Ingersoll is
looking at similar things with his 'taming text' project.

I store the 'documents' in another repository and they are far too
dynamic (write intensive) for direct indexing in Solr... so the
previously suggested procedure of 1) store document 2) execute
more-like-this and 3) delete document would be too slow.

If people are interested I could start a JIRA issue on this (I do not
see anything there at the moment).

Thanks - Neal Richter
http://aicoder.blogspot.com


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Text classification with Solr

Reply via email to