Re: how to retrieve total token count per collection/index

2012-08-20 Thread tech.vronk
Am 09.08.2012 18:02, schrieb Robert Muir: On Thu, Aug 9, 2012 at 10:20 AM, tech.vronk t...@vronk.net wrote: Hello, I wonder how to figure out the total token count in a collection (per index), i.e. the size of a corpus/collection measured in tokens. You want to use this statistic, which

how to retrieve total token count per collection/index

2012-08-09 Thread tech.vronk
Hello, I wonder how to figure out the total token count in a collection (per index), i.e. the size of a corpus/collection measured in tokens. The statistics in /admin tell the number of distinct terms, and the frequency list per index reveals the number of documents with given term. So even

Re: how to retrieve total token count per collection/index

2012-08-09 Thread Walter Underwood
For a rough estimate, square the number of unique terms to get the number of terms. Vocabulary usually goes up as the square root of the corpus size in words. wunder On Aug 9, 2012, at 7:20 AM, tech.vronk wrote: Hello, I wonder how to figure out the total token count in a collection (per

Re: how to retrieve total token count per collection/index

2012-08-09 Thread Robert Muir
On Thu, Aug 9, 2012 at 10:20 AM, tech.vronk t...@vronk.net wrote: Hello, I wonder how to figure out the total token count in a collection (per index), i.e. the size of a corpus/collection measured in tokens. You want to use this statistic, which tells you number of tokens for an indexed

Re: how to retrieve total token count per collection/index

2012-08-09 Thread tech.vronk
Am 09.08.2012 18:02, schrieb Robert Muir: On Thu, Aug 9, 2012 at 10:20 AM, tech.vronk t...@vronk.net wrote: Hello, I wonder how to figure out the total token count in a collection (per index), i.e. the size of a corpus/collection measured in tokens. You want to use this statistic, which

Re: how to retrieve total token count per collection/index

2012-08-09 Thread Robert Muir
On Thu, Aug 9, 2012 at 4:24 PM, tech.vronk t...@vronk.net wrote: Is there any 3.6 equivalent for this, before I install and run 4.0? I can't seem to find a corresponding class (org.apache.lucene.index.Terms) in 3.6. unfortunately 3.6 does not carry this statistic, there is really no clear