Re: Optimal index structure

2005-01-25 Thread Tea Yu
  How many total documents will be there?  I'll opt for a single index if
search in "all categories" meets the performance target, else you may want
to consider distributed searchers.  arguments for a single index:

  1) all doc scores will have to be calculated anyway leveraging Searcher or
(Parallel)MultiSearcher which should be most expensive (with a slight
overhead to aggregate and sort the docs in the latter)
  2) you'll most likely want to aggregate N categories into an index anyway
to avoid having too many opened files
  3) most of the time will be spent in context switching if having too many
indexes searched in parallel

  an alternative will be to optimize the structure base on usage pattern,
e.g. having 1 full category index and several sub-categories indexes, if
reindexing is not a problem

  Tea

  > I'm currently working on building a search function for my application
  > and am looking for guidance on what the optimal way to store the index
  > would be.
  >
  > The application has several different document types with documents
  > split into different categories.  Each category has differing numbers
  > of documents of each type.  A small category may have as few as 0 to 5
  > documents of each type, a large category might have as many 50,000+
  > documents of each type.  There are upwards of 100,000 categories.  The
  > search function would never have to search documents from more than one
  > category at a time, but should be able to search either a single
  > document type or multiple document types together.  I need to be able
  > to handle over 1,000,000 searches a day with as many as 50 simultaneous
  > searches at peak times.
  >
  > My current thinking is that each category would get it's own index.
  > Each document type would have a keyword which indicates which document
  > type it is.  When doing a search, I can either add a filter for that
  > particular document type, or if the search is over all document types I
  > can leave the filter out.  Alternately, I could put everything in 1
  > very large index and choose category and document type by filters.  Or
  > I can have an index for each document type for each category and use
  > multi-index searchers when necessary.
  >
  > I'm afraid that the description above is quite convoluted, so let me
  > know if further clarification is necessary.
  >
  > Any advice is welcome.
  >
  > Thanks


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimal index structure

2005-01-25 Thread Chris Conrad
On Jan 25, 2005, at 5:29 PM, Tea Yu wrote:
  How many total documents will be there?  I'll opt for a single index 
if
search in "all categories" meets the performance target, else you may 
want
to consider distributed searchers.  arguments for a single index:

Fortunately, there is no need for an all categories search.  I won't be 
searching across categories, just across document types.  Total, there 
will be somewhere near 15,000,000 documents across about 100,000 
categories.  But, again, the distribution across categories is very 
uneven.  There will be categories with a total of 5 or so documents, 
with other categories having over 100,000.

  1) all doc scores will have to be calculated anyway leveraging 
Searcher or
(Parallel)MultiSearcher which should be most expensive (with a slight
overhead to aggregate and sort the docs in the latter)
  2) you'll most likely want to aggregate N categories into an index 
anyway
to avoid having too many opened files
I am concerned about the number of concurrent open files, but I think 
that may be mitigated since some categories will receive virtually no 
searches (since they have very few documents or those documents are 
mostly very old).  I would say that the number of categories searched 
frequently will be under 5000.  I was thinking of using a LRU cache of 
open indexes which would keep the number of open files under control 
and ensure that frequently used indexes are quickly available.


  3) most of the time will be spent in context switching if having too 
many
indexes searched in parallel

I will be limiting the number of search threads to 4-12 (this will be 
running a dedicated quad xeon, most likely).

  an alternative will be to optimize the structure base on usage 
pattern,
e.g. having 1 full category index and several sub-categories indexes, 
if
reindexing is not a problem

Re-indexing will be an issue since it looks like it will take on the 
order of 3-4 days to index everything.

Thanks for your input.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]