Index Layout Question

2005-01-27 Thread Jerry Jalenak
I am in the process of indexing about 1.5 million documents, and have
started down the path of indexing these by month.  Each month has between
100,000 and 200,000 documents.  From a performance standpoint, is this the
right approach?  This allows me to use MultiSearcher (or
ParallelMultiSearcher), but I'm not sure if the performance gains are really
there.  Would one monolithic index be better?

Thanks.

Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS  66219
(913) 577-1496

[EMAIL PROTECTED]


This transmission (and any information attached to it) may be confidential and
is intended solely for the use of the individual or entity to which it is
addressed. If you are not the intended recipient or the person responsible for
delivering the transmission to the intended recipient, be advised that you
have received this transmission in error and that any use, dissemination,
forwarding, printing, or copying of this information is strictly prohibited.
If you have received this transmission in error, please immediately notify
LabOne at the following email address: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Layout Question

2005-01-27 Thread Ian Soboroff
"Jerry Jalenak" <[EMAIL PROTECTED]> writes:

> I am in the process of indexing about 1.5 million documents, and have
> started down the path of indexing these by month.  Each month has between
> 100,000 and 200,000 documents.  From a performance standpoint, is this the
> right approach?  This allows me to use MultiSearcher (or
> ParallelMultiSearcher), but I'm not sure if the performance gains are really
> there.  Would one monolithic index be better?

Depends on your search infrastructure.  Doug Cutting has sent out some
basic optimization guidelines on this list which should be in the
archives... simply, you need to think about how many CPUs and spindles
are involved.

1.5m documents isn't a challenge for Lucene to index or search on a
single machine with a monolithic index.  I indexed about 1.6m web
pages in 22 hours on a single machine with all data local, and search
with a single IndexSearcher was instantaneous.  We've also done some
testing with a larger collection (25m pages) and
ParallelMultiSearchers on several machines, and likewise on a fast
network haven't felt a slowdown, but we haven't actually benchmarked
it.

Ian



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Index Layout Question

2005-01-27 Thread Jerry Jalenak
That's good to know.

I'm indexing on 11 fields (9 keyword, 2 text).  The documents themselves are
between 1K to 2K in size.

Is there a point at which IndexSearcher performance begins to fall off?  (in
term of # of index records?)

Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS  66219
(913) 577-1496

[EMAIL PROTECTED]


-Original Message-
From: Ian Soboroff [mailto:[EMAIL PROTECTED]
Sent: Thursday, January 27, 2005 10:31 AM
To: Lucene Users List
Subject: Re: Index Layout Question


"Jerry Jalenak" <[EMAIL PROTECTED]> writes:

> I am in the process of indexing about 1.5 million documents, and have
> started down the path of indexing these by month.  Each month has between
> 100,000 and 200,000 documents.  From a performance standpoint, is this the
> right approach?  This allows me to use MultiSearcher (or
> ParallelMultiSearcher), but I'm not sure if the performance gains are
really
> there.  Would one monolithic index be better?

Depends on your search infrastructure.  Doug Cutting has sent out some
basic optimization guidelines on this list which should be in the
archives... simply, you need to think about how many CPUs and spindles
are involved.

1.5m documents isn't a challenge for Lucene to index or search on a
single machine with a monolithic index.  I indexed about 1.6m web
pages in 22 hours on a single machine with all data local, and search
with a single IndexSearcher was instantaneous.  We've also done some
testing with a larger collection (25m pages) and
ParallelMultiSearchers on several machines, and likewise on a fast
network haven't felt a slowdown, but we haven't actually benchmarked
it.

Ian



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


This transmission (and any information attached to it) may be confidential and
is intended solely for the use of the individual or entity to which it is
addressed. If you are not the intended recipient or the person responsible for
delivering the transmission to the intended recipient, be advised that you
have received this transmission in error and that any use, dissemination,
forwarding, printing, or copying of this information is strictly prohibited.
If you have received this transmission in error, please immediately notify
LabOne at the following email address: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]