Re: How to retrieve the full corpus
If you want to do a mass scan of an index, the most scalable way is to make a variation of the Lucene CheckIndex program. Unfortunately, CheckIndex does not know any of the Solr types. But first, you should try the above techniques because they are much much easier. On Mon, Sep 6, 2010 at 7:59 AM, Markus Jelsma markus.jel...@buyways.nl wrote: You can use Luke to inspect a Lucene index. Check the schema browser in your Solr admin interface for an example. On Monday 06 September 2010 16:52:03 Roland Villemoes wrote: Hi All, How can I retrieve all words from a Solr core? I need a list of all the words and how often they occur in the index. med venlig hilsen/best regards Roland Villemoes Tel: (+45) 22 69 59 62 E-Mail: mailto:r...@alpha-solutions.dk Alpha Solutions A/S Borgergade 2, 3.sal, 1300 København K Tel: (+45) 70 20 65 38 Web: http://www.alpha-solutions.dkhttp://www.alpha-solutions.dk/ ** This message including any attachments may contain confidential and/or privileged information intended only for the person or entity to which it is addressed. If you are not the intended recipient you should delete this message. Any printing, copying, distribution or other use of this message is strictly prohibited. If you have received this message in error, please notify the sender immediately by telephone, or e-mail and delete all copies of this message and any attachments from your system. Thank you. Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 -- Lance Norskog goks...@gmail.com
How to retrieve the full corpus
Hi All, How can I retrieve all words from a Solr core? I need a list of all the words and how often they occur in the index. med venlig hilsen/best regards Roland Villemoes Tel: (+45) 22 69 59 62 E-Mail: mailto:r...@alpha-solutions.dk Alpha Solutions A/S Borgergade 2, 3.sal, 1300 København K Tel: (+45) 70 20 65 38 Web: http://www.alpha-solutions.dkhttp://www.alpha-solutions.dk/ ** This message including any attachments may contain confidential and/or privileged information intended only for the person or entity to which it is addressed. If you are not the intended recipient you should delete this message. Any printing, copying, distribution or other use of this message is strictly prohibited. If you have received this message in error, please notify the sender immediately by telephone, or e-mail and delete all copies of this message and any attachments from your system. Thank you.
Re: How to retrieve the full corpus
You might check out Luke, the Lucene Index Toolbox. http://www.getopt.org/luke/ I know you can browse the index and get frequency counts, though I'm not sure if you can export the entire index as a list like what you're looking for. Hope this helps, Mike On Mon, Sep 6, 2010 at 10:52 AM, Roland Villemoes r...@alpha-solutions.dkwrote: Hi All, How can I retrieve all words from a Solr core? I need a list of all the words and how often they occur in the index. med venlig hilsen/best regards Roland Villemoes Tel: (+45) 22 69 59 62 E-Mail: mailto:r...@alpha-solutions.dk Alpha Solutions A/S Borgergade 2, 3.sal, 1300 København K Tel: (+45) 70 20 65 38 Web: http://www.alpha-solutions.dkhttp://www.alpha-solutions.dk/ ** This message including any attachments may contain confidential and/or privileged information intended only for the person or entity to which it is addressed. If you are not the intended recipient you should delete this message. Any printing, copying, distribution or other use of this message is strictly prohibited. If you have received this message in error, please notify the sender immediately by telephone, or e-mail and delete all copies of this message and any attachments from your system. Thank you.
Re: How to retrieve the full corpus
On Mon, Sep 6, 2010 at 10:52 AM, Roland Villemoes r...@alpha-solutions.dk wrote: How can I retrieve all words from a Solr core? I need a list of all the words and how often they occur in the index. http://wiki.apache.org/solr/TermsComponent It doesn't currently stream though, so requesting *all* at once might take too much memory. One workaround is to page via terms.lower and terms.limit. Perhaps we should consider adding streaming to the terms component though. Would you mind opening a JIRA issue? -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: How to retrieve the full corpus
On 2010-09-06 17:15, Yonik Seeley wrote: On Mon, Sep 6, 2010 at 10:52 AM, Roland Villemoesr...@alpha-solutions.dk wrote: How can I retrieve all words from a Solr core? I need a list of all the words and how often they occur in the index. http://wiki.apache.org/solr/TermsComponent It doesn't currently stream though, so requesting *all* at once might take too much memory. One workaround is to page via terms.lower and terms.limit. Perhaps we should consider adding streaming to the terms component though. Would you mind opening a JIRA issue? This would be nice also for building a spellchecker in another core (instead of using the current sub-index hack). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How to retrieve the full corpus
You can use Luke to inspect a Lucene index. Check the schema browser in your Solr admin interface for an example. On Monday 06 September 2010 16:52:03 Roland Villemoes wrote: Hi All, How can I retrieve all words from a Solr core? I need a list of all the words and how often they occur in the index. med venlig hilsen/best regards Roland Villemoes Tel: (+45) 22 69 59 62 E-Mail: mailto:r...@alpha-solutions.dk Alpha Solutions A/S Borgergade 2, 3.sal, 1300 København K Tel: (+45) 70 20 65 38 Web: http://www.alpha-solutions.dkhttp://www.alpha-solutions.dk/ ** This message including any attachments may contain confidential and/or privileged information intended only for the person or entity to which it is addressed. If you are not the intended recipient you should delete this message. Any printing, copying, distribution or other use of this message is strictly prohibited. If you have received this message in error, please notify the sender immediately by telephone, or e-mail and delete all copies of this message and any attachments from your system. Thank you. Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350