Re: Total number of terms in an index?

2010-07-28 Thread Jason Rutherglen
Tom,

The total number of terms... Ah well, not a big deal, however yes the
flex branch does expose this so we can show this in Solr at some
point, hopefully outside of Solr's Luke impl.

On Tue, Jul 27, 2010 at 9:27 AM, Burton-West, Tom tburt...@umich.edu wrote:
 Hi Jason,

 Are you looking for the total number of unique terms or total number of term 
 occurrences?

 Checkindex reports both, but does a bunch of other work so is probably not 
 the fastest.

 If you are looking for total number of term occurrences, you might look at 
 contrib/org/apache/lucene/misc/HighFreqTerms.java.

 If you are just looking for the total number of unique terms, I wonder if 
 there is some low level API that would allow you to just access the in-memory 
 representation of the tii file and then multiply the number of terms in it by 
 your indexDivisor (default 128). I haven't dug in to the code so I don't 
 actually know how the tii file gets loaded into a data structure in memory.  
 If there is api access, it seems like this might be the quickest way to get 
 the number of unique terms.  (Of course you would have to do this for each 
 segment).

 Tom
 -Original Message-
 From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
 Sent: Monday, July 26, 2010 8:39 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Total number of terms in an index?


 : Sorry, like the subject, I mean the total number of terms.

 it's not stored anywhere, so the only way to fetch it is to actually
 iteate all of the terms and count them (that's why LukeRequestHandler is
 slow slow to compute this particular value)

 If i remember right, someone mentioned at one point that flex would let
 you store data about stuff like this in your index as part of the segment
 writing, but frankly i'm still not sure how that iwll help -- because you
 unless your index is fully optimized, you still have to iterate the terms
 in each segment to 'de-dup' them.


 -Hoss




Re: Total number of terms in an index?

2010-07-28 Thread Jonathan Rochkind
At first I was thinking the TermsComponent might give you this, but 
oddly it seems not to.


http://wiki.apache.org/solr/TermsComponent




RE: Total number of terms in an index?

2010-07-27 Thread Burton-West, Tom
Hi Jason,

Are you looking for the total number of unique terms or total number of term 
occurrences?

Checkindex reports both, but does a bunch of other work so is probably not the 
fastest.

If you are looking for total number of term occurrences, you might look at 
contrib/org/apache/lucene/misc/HighFreqTerms.java.
 
If you are just looking for the total number of unique terms, I wonder if there 
is some low level API that would allow you to just access the in-memory 
representation of the tii file and then multiply the number of terms in it by 
your indexDivisor (default 128). I haven't dug in to the code so I don't 
actually know how the tii file gets loaded into a data structure in memory.  If 
there is api access, it seems like this might be the quickest way to get the 
number of unique terms.  (Of course you would have to do this for each segment).

Tom
-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Monday, July 26, 2010 8:39 PM
To: solr-user@lucene.apache.org
Subject: Re: Total number of terms in an index?


: Sorry, like the subject, I mean the total number of terms.

it's not stored anywhere, so the only way to fetch it is to actually 
iteate all of the terms and count them (that's why LukeRequestHandler is 
slow slow to compute this particular value)

If i remember right, someone mentioned at one point that flex would let 
you store data about stuff like this in your index as part of the segment 
writing, but frankly i'm still not sure how that iwll help -- because you 
unless your index is fully optimized, you still have to iterate the terms 
in each segment to 'de-dup' them.


-Hoss



Re: Total number of terms in an index?

2010-07-27 Thread Michael McCandless
In trunk (flex) you can ask each segment for its unique term count.

But to compute the unique term count across all segments is
necessarily costly (requires merging them, to de-dup), as Hoss
described.

Mike

On Tue, Jul 27, 2010 at 12:27 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Hi Jason,

 Are you looking for the total number of unique terms or total number of term 
 occurrences?

 Checkindex reports both, but does a bunch of other work so is probably not 
 the fastest.

 If you are looking for total number of term occurrences, you might look at 
 contrib/org/apache/lucene/misc/HighFreqTerms.java.

 If you are just looking for the total number of unique terms, I wonder if 
 there is some low level API that would allow you to just access the in-memory 
 representation of the tii file and then multiply the number of terms in it by 
 your indexDivisor (default 128). I haven't dug in to the code so I don't 
 actually know how the tii file gets loaded into a data structure in memory.  
 If there is api access, it seems like this might be the quickest way to get 
 the number of unique terms.  (Of course you would have to do this for each 
 segment).

 Tom
 -Original Message-
 From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
 Sent: Monday, July 26, 2010 8:39 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Total number of terms in an index?


 : Sorry, like the subject, I mean the total number of terms.

 it's not stored anywhere, so the only way to fetch it is to actually
 iteate all of the terms and count them (that's why LukeRequestHandler is
 slow slow to compute this particular value)

 If i remember right, someone mentioned at one point that flex would let
 you store data about stuff like this in your index as part of the segment
 writing, but frankly i'm still not sure how that iwll help -- because you
 unless your index is fully optimized, you still have to iterate the terms
 in each segment to 'de-dup' them.


 -Hoss




Re: Total number of terms in an index?

2010-07-26 Thread Jason Rutherglen
Sorry, like the subject, I mean the total number of terms.

On Mon, Jul 26, 2010 at 4:03 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 What's the fastest way to obtain the total number of docs from the
 index?  (The Luke request handler takes a long time to load so I'm
 looking for something else).



Re: Total number of terms in an index?

2010-07-26 Thread Chris Hostetter

: Sorry, like the subject, I mean the total number of terms.

it's not stored anywhere, so the only way to fetch it is to actually 
iteate all of the terms and count them (that's why LukeRequestHandler is 
slow slow to compute this particular value)

If i remember right, someone mentioned at one point that flex would let 
you store data about stuff like this in your index as part of the segment 
writing, but frankly i'm still not sure how that iwll help -- because you 
unless your index is fully optimized, you still have to iterate the terms 
in each segment to 'de-dup' them.


-Hoss