Re: How to retrieve the full corpus

2010-09-08 Thread Lance Norskog
If you want to do a mass scan of an index, the most scalable way is to
make a variation of the Lucene CheckIndex program. Unfortunately,
CheckIndex does not know any of the Solr types.

But first, you should try the above techniques because they are much
much easier.

On Mon, Sep 6, 2010 at 7:59 AM, Markus Jelsma markus.jel...@buyways.nl wrote:
 You can use Luke to inspect a Lucene index. Check the schema browser in your
 Solr admin interface for an example.

 On Monday 06 September 2010 16:52:03 Roland Villemoes wrote:
 Hi All,

 How can I retrieve all words from a Solr core?
 I need a list of all the words and how often they occur in the index.

 med venlig hilsen/best regards

 Roland Villemoes
 Tel: (+45) 22 69 59 62
 E-Mail: mailto:r...@alpha-solutions.dk

 Alpha Solutions A/S
 Borgergade 2, 3.sal, 1300 København K
 Tel: (+45) 70 20 65 38
 Web: http://www.alpha-solutions.dkhttp://www.alpha-solutions.dk/

 ** This message including any attachments may contain confidential and/or
  privileged information intended only for the person or entity to which it
  is addressed. If you are not the intended recipient you should delete this
  message. Any printing, copying, distribution or other use of this message
  is strictly prohibited. If you have received this message in error, please
  notify the sender immediately by telephone, or e-mail and delete all
  copies of this message and any attachments from your system. Thank you.


 Markus Jelsma - Technisch Architect - Buyways BV
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350





-- 
Lance Norskog
goks...@gmail.com


How to retrieve the full corpus

2010-09-06 Thread Roland Villemoes
Hi All,

How can I retrieve all words from a Solr core?
I need a list of all the words and how often they occur in the index.

med venlig hilsen/best regards

Roland Villemoes
Tel: (+45) 22 69 59 62
E-Mail: mailto:r...@alpha-solutions.dk

Alpha Solutions A/S
Borgergade 2, 3.sal, 1300 København K
Tel: (+45) 70 20 65 38
Web: http://www.alpha-solutions.dkhttp://www.alpha-solutions.dk/

** This message including any attachments may contain confidential and/or 
privileged information intended only for the person or entity to which it is 
addressed. If you are not the intended recipient you should delete this 
message. Any printing, copying, distribution or other use of this message is 
strictly prohibited. If you have received this message in error, please notify 
the sender immediately by telephone, or e-mail and delete all copies of this 
message and any attachments from your system. Thank you.



Re: How to retrieve the full corpus

2010-09-06 Thread mike anderson
You might check out Luke, the Lucene Index Toolbox.

http://www.getopt.org/luke/

I know you can browse the index and get frequency counts, though I'm not
sure if you can export the entire index as a list like what you're looking
for.

Hope this helps,
Mike

On Mon, Sep 6, 2010 at 10:52 AM, Roland Villemoes 
r...@alpha-solutions.dkwrote:

 Hi All,

 How can I retrieve all words from a Solr core?
 I need a list of all the words and how often they occur in the index.

 med venlig hilsen/best regards

 Roland Villemoes
 Tel: (+45) 22 69 59 62
 E-Mail: mailto:r...@alpha-solutions.dk

 Alpha Solutions A/S
 Borgergade 2, 3.sal, 1300 København K
 Tel: (+45) 70 20 65 38
 Web: http://www.alpha-solutions.dkhttp://www.alpha-solutions.dk/

 ** This message including any attachments may contain confidential and/or
 privileged information intended only for the person or entity to which it is
 addressed. If you are not the intended recipient you should delete this
 message. Any printing, copying, distribution or other use of this message is
 strictly prohibited. If you have received this message in error, please
 notify the sender immediately by telephone, or e-mail and delete all copies
 of this message and any attachments from your system. Thank you.




Re: How to retrieve the full corpus

2010-09-06 Thread Yonik Seeley
On Mon, Sep 6, 2010 at 10:52 AM, Roland Villemoes r...@alpha-solutions.dk 
wrote:
 How can I retrieve all words from a Solr core?
 I need a list of all the words and how often they occur in the index.

http://wiki.apache.org/solr/TermsComponent

It doesn't currently stream though, so requesting *all* at once might
take too much memory.  One workaround is to page via terms.lower and
terms.limit.
Perhaps we should consider adding streaming to the terms component
though.  Would you mind opening a JIRA issue?

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: How to retrieve the full corpus

2010-09-06 Thread Andrzej Bialecki

On 2010-09-06 17:15, Yonik Seeley wrote:

On Mon, Sep 6, 2010 at 10:52 AM, Roland Villemoesr...@alpha-solutions.dk  
wrote:

How can I retrieve all words from a Solr core?
I need a list of all the words and how often they occur in the index.


http://wiki.apache.org/solr/TermsComponent

It doesn't currently stream though, so requesting *all* at once might
take too much memory.  One workaround is to page via terms.lower and
terms.limit.
Perhaps we should consider adding streaming to the terms component
though.  Would you mind opening a JIRA issue?


This would be nice also for building a spellchecker in another core 
(instead of using the current sub-index hack).



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How to retrieve the full corpus

2010-09-06 Thread Markus Jelsma
You can use Luke to inspect a Lucene index. Check the schema browser in your 
Solr admin interface for an example.

On Monday 06 September 2010 16:52:03 Roland Villemoes wrote:
 Hi All,
 
 How can I retrieve all words from a Solr core?
 I need a list of all the words and how often they occur in the index.
 
 med venlig hilsen/best regards
 
 Roland Villemoes
 Tel: (+45) 22 69 59 62
 E-Mail: mailto:r...@alpha-solutions.dk
 
 Alpha Solutions A/S
 Borgergade 2, 3.sal, 1300 København K
 Tel: (+45) 70 20 65 38
 Web: http://www.alpha-solutions.dkhttp://www.alpha-solutions.dk/
 
 ** This message including any attachments may contain confidential and/or
  privileged information intended only for the person or entity to which it
  is addressed. If you are not the intended recipient you should delete this
  message. Any printing, copying, distribution or other use of this message
  is strictly prohibited. If you have received this message in error, please
  notify the sender immediately by telephone, or e-mail and delete all
  copies of this message and any attachments from your system. Thank you.
 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350