Accessing raw index data
Hi, I have just setup my first Solr 4.0 instance and have added about one million documents. I would like to access the raw data stored in the index. Can somebody give me a starting point how to do that? As a first step, a simple dump would be absolutely ok. I just want to play around and do some static offline analysis. In the long term, I probably would like to implement custom search components to enrich my search results. So if there's no export for raw data, I would be happy to learn how to implement custom handlers and/or search components. Some guidance where to start would be very appreciated. kind regards, Achim
Re: Accessing raw index data
On 12 January 2013 01:06, Achim Domma do...@procoders.net wrote: Hi, I have just setup my first Solr 4.0 instance and have added about one million documents. I would like to access the raw data stored in the index. Can somebody give me a starting point how to do that? As a first step, a simple dump would be absolutely ok. I just want to play around and do some static offline analysis. In the long term, I probably would like to implement custom search components to enrich my search results. So if there's no export for raw data, I would be happy to learn how to implement custom handlers and/or search components. Some guidance where to start would be very appreciated. It is not clear what you mean by raw data, and what level of customisation you are after. Here are two possibilities: * At the base, Solr indexes are Lucene indexes, so one can always drop down to that level. * Also, Solr allows plugins for various components. This link might be of help, depending on the extent of customisation you are after: http://wiki.apache.org/solr/SolrPlugins Maybe you should approach this from the other end: If you could describe what you are trying to achieve, people might be able to offer possibilities. Regards, Gora
Re: Accessing raw index data
At the base, Solr indexes are Lucene indexes, so one can always drop down to that level. That's what I'm looking for. I understand, that at the end, there has to be an inverse index (or rather multiple of them), holding all words which occurre in my documents, each word having a list of documents the word was part of. I would like to do some statistics based on this information, would like to analyze how it changes if I change my text processing settings, ... If you would give me a starting point like Data is stored in Lucene indexes, which are documented at XXX. In a request handler you can access the indexes via YYY., I would be perfectly happy figuring out the rest on my own. Documentation about 4.0 is a bit limited, so it's hard to find an entry point. cheers, Achim Am 11.01.2013 um 20:54 schrieb Gora Mohanty: On 12 January 2013 01:06, Achim Domma do...@procoders.net wrote: Hi, I have just setup my first Solr 4.0 instance and have added about one million documents. I would like to access the raw data stored in the index. Can somebody give me a starting point how to do that? As a first step, a simple dump would be absolutely ok. I just want to play around and do some static offline analysis. In the long term, I probably would like to implement custom search components to enrich my search results. So if there's no export for raw data, I would be happy to learn how to implement custom handlers and/or search components. Some guidance where to start would be very appreciated. It is not clear what you mean by raw data, and what level of customisation you are after. Here are two possibilities: * At the base, Solr indexes are Lucene indexes, so one can always drop down to that level. * Also, Solr allows plugins for various components. This link might be of help, depending on the extent of customisation you are after: http://wiki.apache.org/solr/SolrPlugins Maybe you should approach this from the other end: If you could describe what you are trying to achieve, people might be able to offer possibilities. Regards, Gora
Re: Accessing raw index data
On 12 January 2013 02:03, Achim Domma do...@procoders.net wrote: At the base, Solr indexes are Lucene indexes, so one can always drop down to that level. That's what I'm looking for. I understand, that at the end, there has to be an inverse index (or rather multiple of them), holding all words which occurre in my documents, each word having a list of documents the word was part of. I would like to do some statistics based on this information, would like to analyze how it changes if I change my text processing settings, ... If you would give me a starting point like Data is stored in Lucene indexes, which are documented at XXX. In a request handler you can access the indexes via YYY., I would be perfectly happy figuring out the rest on my own. Documentation about 4.0 is a bit limited, so it's hard to find an entry point. Sadly, you have hit the limits of my knowledge: We have not yet had the need to delve into details of Lucene indexes, but I am sure that others can fill in. Regards, Gora
Re: Accessing raw index data
Have you looked at Solr admin interface in details? Specifically, analysis section under each core. It provides some of the statistics you seem to want. And, gives you the source code to look at to understand how to create your own version of that. Specifically, the Luke package is what you might be looking for. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 11, 2013 at 3:33 PM, Achim Domma do...@procoders.net wrote: At the base, Solr indexes are Lucene indexes, so one can always drop down to that level. That's what I'm looking for. I understand, that at the end, there has to be an inverse index (or rather multiple of them), holding all words which occurre in my documents, each word having a list of documents the word was part of. I would like to do some statistics based on this information, would like to analyze how it changes if I change my text processing settings, ... If you would give me a starting point like Data is stored in Lucene indexes, which are documented at XXX. In a request handler you can access the indexes via YYY., I would be perfectly happy figuring out the rest on my own. Documentation about 4.0 is a bit limited, so it's hard to find an entry point. cheers, Achim Am 11.01.2013 um 20:54 schrieb Gora Mohanty: On 12 January 2013 01:06, Achim Domma do...@procoders.net wrote: Hi, I have just setup my first Solr 4.0 instance and have added about one million documents. I would like to access the raw data stored in the index. Can somebody give me a starting point how to do that? As a first step, a simple dump would be absolutely ok. I just want to play around and do some static offline analysis. In the long term, I probably would like to implement custom search components to enrich my search results. So if there's no export for raw data, I would be happy to learn how to implement custom handlers and/or search components. Some guidance where to start would be very appreciated. It is not clear what you mean by raw data, and what level of customisation you are after. Here are two possibilities: * At the base, Solr indexes are Lucene indexes, so one can always drop down to that level. * Also, Solr allows plugins for various components. This link might be of help, depending on the extent of customisation you are after: http://wiki.apache.org/solr/SolrPlugins Maybe you should approach this from the other end: If you could describe what you are trying to achieve, people might be able to offer possibilities. Regards, Gora
Re: Accessing raw index data
On 1/11/2013 1:33 PM, Achim Domma wrote: At the base, Solr indexes are Lucene indexes, so one can always drop down to that level. That's what I'm looking for. I understand, that at the end, there has to be an inverse index (or rather multiple of them), holding all words which occurre in my documents, each word having a list of documents the word was part of. I would like to do some statistics based on this information, would like to analyze how it changes if I change my text processing settings, ... If you would give me a starting point like Data is stored in Lucene indexes, which are documented at XXX. In a request handler you can access the indexes via YYY., I would be perfectly happy figuring out the rest on my own. Documentation about 4.0 is a bit limited, so it's hard to find an entry point. There is the TermsComponent, which can be utilized in a terms requestHandler. The example solrconfig.xml found in all downloaded copies of Solr has a /terms request handler. http://wiki.apache.org/solr/TermsComponent As you've already been told, there is a tool called Luke, but a version that works with Solr 4.0.0 is hard to find. The official download location only has a 4.0.0-ALPHA version, and there have been reported problems using it with indexes from the final Solr 4.0.0. Thanks, Shawn