Accessing raw index data

2013-01-11 Thread Achim Domma
Hi,

I have just setup my first Solr 4.0 instance and have added about one million 
documents. I would like to access the raw data stored in the index. Can 
somebody give me a starting point how to do that?

As a first step, a simple dump would be absolutely ok. I just want to play 
around and do some static offline analysis. In the long term, I probably would 
like to implement custom search components to enrich my search results. So if 
there's no export for raw data, I would be happy to learn how to implement 
custom handlers and/or search components. Some guidance where to start would be 
very appreciated.

kind regards,
Achim

Re: Accessing raw index data

2013-01-11 Thread Gora Mohanty
On 12 January 2013 01:06, Achim Domma do...@procoders.net wrote:

 Hi,

 I have just setup my first Solr 4.0 instance and have added about one
 million documents. I would like to access the raw data stored in the index.
 Can somebody give me a starting point how to do that?

 As a first step, a simple dump would be absolutely ok. I just want to play
 around and do some static offline analysis. In the long term, I probably
 would like to implement custom search components to enrich my search
 results. So if there's no export for raw data, I would be happy to learn how
 to implement custom handlers and/or search components. Some guidance where
 to start would be very appreciated.

It is not clear what you mean by raw data, and what level of
customisation you are after. Here are two possibilities:
* At the base, Solr indexes are Lucene indexes, so one can always
  drop down to that level.
* Also, Solr allows plugins for various components. This link might
  be of help, depending on the extent of customisation you are after:
  http://wiki.apache.org/solr/SolrPlugins

Maybe you should approach this from the other end: If you could
describe what you are trying to achieve, people might be able to
offer possibilities.

Regards,
Gora


Re: Accessing raw index data

2013-01-11 Thread Achim Domma
At the base, Solr indexes are Lucene indexes, so one can always
 drop down to that level.

That's what I'm looking for. I understand, that at the end, there has to be an 
inverse index (or rather multiple of them), holding all words which occurre 
in my documents, each word having a list of documents the word was part of. 
I would like to do some statistics based on this information, would like to 
analyze how it changes if I change my text processing settings, ...

If you would give me a starting point like Data is stored in Lucene indexes, 
which are documented at XXX. In a request handler you can access the indexes 
via YYY., I would be perfectly happy figuring out the rest on my own. 
Documentation about 4.0 is a bit limited, so it's hard to find an entry point.

cheers,
Achim

Am 11.01.2013 um 20:54 schrieb Gora Mohanty:

 On 12 January 2013 01:06, Achim Domma do...@procoders.net wrote:
 
 Hi,
 
 I have just setup my first Solr 4.0 instance and have added about one
 million documents. I would like to access the raw data stored in the index.
 Can somebody give me a starting point how to do that?
 
 As a first step, a simple dump would be absolutely ok. I just want to play
 around and do some static offline analysis. In the long term, I probably
 would like to implement custom search components to enrich my search
 results. So if there's no export for raw data, I would be happy to learn how
 to implement custom handlers and/or search components. Some guidance where
 to start would be very appreciated.
 
 It is not clear what you mean by raw data, and what level of
 customisation you are after. Here are two possibilities:
 * At the base, Solr indexes are Lucene indexes, so one can always
  drop down to that level.
 * Also, Solr allows plugins for various components. This link might
  be of help, depending on the extent of customisation you are after:
  http://wiki.apache.org/solr/SolrPlugins
 
 Maybe you should approach this from the other end: If you could
 describe what you are trying to achieve, people might be able to
 offer possibilities.
 
 Regards,
 Gora



Re: Accessing raw index data

2013-01-11 Thread Gora Mohanty
On 12 January 2013 02:03, Achim Domma do...@procoders.net wrote:
 At the base, Solr indexes are Lucene indexes, so one can always
  drop down to that level.

 That's what I'm looking for. I understand, that at the end, there has to be 
 an inverse index (or rather multiple of them), holding all words which 
 occurre in my documents, each word having a list of documents the word 
 was part of. I would like to do some statistics based on this information, 
 would like to analyze how it changes if I change my text processing settings, 
 ...

 If you would give me a starting point like Data is stored in Lucene indexes, 
 which are documented at XXX. In a request handler you can access the indexes 
 via YYY., I would be perfectly happy figuring out the rest on my own. 
 Documentation about 4.0 is a bit limited, so it's hard to find an entry point.

Sadly, you have hit the limits of my knowledge: We
have not yet had the need to delve into details of
Lucene indexes, but I am sure that others can fill in.

Regards,
Gora


Re: Accessing raw index data

2013-01-11 Thread Alexandre Rafalovitch
Have you looked at Solr admin interface in details? Specifically, analysis
section under each core. It provides some of the statistics you seem to
want. And, gives you the source code to look at to understand how to create
your own version of that. Specifically, the Luke package is what you
might be looking for.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jan 11, 2013 at 3:33 PM, Achim Domma do...@procoders.net wrote:

 At the base, Solr indexes are Lucene indexes, so one can always
  drop down to that level.

 That's what I'm looking for. I understand, that at the end, there has to
 be an inverse index (or rather multiple of them), holding all words which
 occurre in my documents, each word having a list of documents the word
 was part of. I would like to do some statistics based on this information,
 would like to analyze how it changes if I change my text processing
 settings, ...

 If you would give me a starting point like Data is stored in Lucene
 indexes, which are documented at XXX. In a request handler you can access
 the indexes via YYY., I would be perfectly happy figuring out the rest on
 my own. Documentation about 4.0 is a bit limited, so it's hard to find an
 entry point.

 cheers,
 Achim

 Am 11.01.2013 um 20:54 schrieb Gora Mohanty:

  On 12 January 2013 01:06, Achim Domma do...@procoders.net wrote:
 
  Hi,
 
  I have just setup my first Solr 4.0 instance and have added about one
  million documents. I would like to access the raw data stored in the
 index.
  Can somebody give me a starting point how to do that?
 
  As a first step, a simple dump would be absolutely ok. I just want to
 play
  around and do some static offline analysis. In the long term, I probably
  would like to implement custom search components to enrich my search
  results. So if there's no export for raw data, I would be happy to
 learn how
  to implement custom handlers and/or search components. Some guidance
 where
  to start would be very appreciated.
 
  It is not clear what you mean by raw data, and what level of
  customisation you are after. Here are two possibilities:
  * At the base, Solr indexes are Lucene indexes, so one can always
   drop down to that level.
  * Also, Solr allows plugins for various components. This link might
   be of help, depending on the extent of customisation you are after:
   http://wiki.apache.org/solr/SolrPlugins
 
  Maybe you should approach this from the other end: If you could
  describe what you are trying to achieve, people might be able to
  offer possibilities.
 
  Regards,
  Gora




Re: Accessing raw index data

2013-01-11 Thread Shawn Heisey

On 1/11/2013 1:33 PM, Achim Domma wrote:

At the base, Solr indexes are Lucene indexes, so one can always
  drop down to that level.

That's what I'm looking for. I understand, that at the end, there has to be an inverse index (or rather 
multiple of them), holding all words which occurre in my documents, each word having 
a list of documents the word was part of. I would like to do some statistics based on this 
information, would like to analyze how it changes if I change my text processing settings, ...

If you would give me a starting point like Data is stored in Lucene indexes, which 
are documented at XXX. In a request handler you can access the indexes via YYY., I 
would be perfectly happy figuring out the rest on my own. Documentation about 4.0 is a 
bit limited, so it's hard to find an entry point.


There is the TermsComponent, which can be utilized in a terms 
requestHandler.  The example solrconfig.xml found in all downloaded 
copies of Solr has a /terms request handler.


http://wiki.apache.org/solr/TermsComponent

As you've already been told, there is a tool called Luke, but a version 
that works with Solr 4.0.0 is hard to find.  The official download 
location only has a 4.0.0-ALPHA version, and there have been reported 
problems using it with indexes from the final Solr 4.0.0.


Thanks,
Shawn