Re: [lucy-user] Reindexing and concurrent updates

Marvin Humphrey Mon, 17 Dec 2012 14:22:03 -0800

On Mon, Dec 17, 2012 at 11:52 AM, Nick Wellnhofer <[email protected]> wrote:
> what's the best way to reindex the database completely while still allowing
> concurrent updates? My current plan is to start indexing documents until
> half of write_lock_timeout has passed and then sleep for half of
> write_lock_timeout. Does this make sense?


Let's assume that there are a maximum of two processes contending for write
access to a single index:

*   The "indexer", which accepts new content.
*   The "updater", which reindexes documents which were already in the system.

We'll ignore the potential issue of multiple new documents arriving
simultaneously -- we assume that adds are serialized through the "indexer"
process somehow, perhaps by queueing.

To guarantee that the "indexer" never times out waiting for a write lock, the
"updater" needs good worst-case performance.  The algorithm you describe will
give good average performance, but not good enough worst-case perforance.

    
http://lucy.apache.org/docs/perl/Lucy/Docs/Cookbook/FastUpdates.html#ABSTRACT

    While index updates are fast on average, worst-case update performance may
    be significantly slower. To make index updates consistently quick, we must
    manually intervene to control the process of index segment consolidation.

To guarantee good responsiveness by the "indexer" process, both "indexer" and
"updater" need to limit the amount of existing content that they will recycle
and you need an additional BackgroundMerger process as described in
Lucy::Docs::Cookbook::FastUpdates to keep the number of segments from growing
out of control.

> Is there a better way?

One alternative is to reindex into a new directory off to the side, queueing
new adds as they come in and adding them after reindexing finishes.  When the
new index is caught up, swap it into place.

Disk space is cheap, so that's generally not an issue.  However, you may have
to watch out for IO cache memory usage by the side process if a production
searcher depends on having most of the live index cached in RAM to achieve
good search-time performance.

Marvin Humphrey

Re: [lucy-user] Reindexing and concurrent updates

Reply via email to