On 1/11/2013 1:33 PM, Achim Domma wrote:
"At the base, Solr indexes are Lucene indexes, so one can always
drop down to that level."
That's what I'm looking for. I understand, that at the end, there has to be an inverse index (or rather
multiple of them), holding all "words" which occurre in my documents, each "word" having
a list of documents the "word" was part of. I would like to do some statistics based on this
information, would like to analyze how it changes if I change my text processing settings, ...
If you would give me a starting point like "Data is stored in Lucene indexes, which
are documented at XXX. In a request handler you can access the indexes via YYY.", I
would be perfectly happy figuring out the rest on my own. Documentation about 4.0 is a
bit limited, so it's hard to find an entry point.
There is the TermsComponent, which can be utilized in a terms
requestHandler. The example solrconfig.xml found in all downloaded
copies of Solr has a /terms request handler.
http://wiki.apache.org/solr/TermsComponent
As you've already been told, there is a tool called Luke, but a version
that works with Solr 4.0.0 is hard to find. The official download
location only has a 4.0.0-ALPHA version, and there have been reported
problems using it with indexes from the final Solr 4.0.0.
Thanks,
Shawn