Re: [lucy-user] C library, how to check index is healthy
On 28/02/2017 20:17, Serkan Mulayim wrote: So as I see: 1- when we do indexing operation in an existing index, a new segment is created and it is not put into the index until it is committed. When it is committed, its segment is kept separately and the snapshot.json file is updated to include the new segment. That's right, but segments are merged occasionally. 2- lock files are being generated and are kept separate based on the pid (no shared FS adjustments). What I would like to do is, to be able to index thousands of documents in batches with asynchronous calls to the library. Asynchronous calls will try to update the newly created segment to be written by different calls. If PIDs are the same, it seems like system will crash due to write.lock containing the PIDs. This has nothing to do with PIDs (they're only used to remove stale lock files). You'll receive a LockErr exception if an Indexer can't acquire the write lock after several retries regardless of the process ID. Do you think there is a way to make this work with calls from different PIDs, with an addition of commit.lock file? I hope this makes sense :( :) Parallel indexing isn't supported by Lucy. We only support background merging which is mostly geared towards interactive applications that only index a few documents at a time. Non-interactive batch jobs that index thousands of documents in parallel aren't handled well by Lucy, although this could probably be improved. Your only options right now are: - If it's OK for your indexing processes to potentially wait for a long time, increase the write lock timeout to a huge value or catch LockErrs and implement your own retry logic. - Implement your own document queue where multiple processes can add documents and a single indexing process removes them. One more question is when I index documents and commit each time (let's say 5000 batches of commits in synchronous way), I see that the indexing works fine. How are the segments being handled. I do not see that 5000 different segments created. Is it because after a certain number of segments (say 32), the segments are being merged and optimized? Yes, that's how it works. The FastUpdates cookbook entry contains more details: https://lucy.apache.org/docs/c/Lucy/Docs/Cookbook/FastUpdates.html But I don't think background merging would help much in your case. Nick
Re: [lucy-user] C library, how to check index is healthy
Thanks guys very much for your comments. And sorry for my late response. Nick, I have a few follow up questions regarding your comments. So as I see: 1- when we do indexing operation in an existing index, a new segment is created and it is not put into the index until it is committed. When it is committed, its segment is kept separately and the snapshot.json file is updated to include the new segment. 2- lock files are being generated and are kept separate based on the pid (no shared FS adjustments). >From the documentation about Indexer: "In general, only one Indexer at a time may write to an index safely. If a write lock cannot be secured, new() will throw an exception." What I would like to do is, to be able to index thousands of documents in batches with asynchronous calls to the library. Asynchronous calls will try to update the newly created segment to be written by different calls. If PIDs are the same, it seems like system will crash due to write.lock containing the PIDs. Do you think there is a way to make this work with calls from different PIDs, with an addition of commit.lock file? I hope this makes sense :( :) One more question is when I index documents and commit each time (let's say 5000 batches of commits in synchronous way), I see that the indexing works fine. How are the segments being handled. I do not see that 5000 different segments created. Is it because after a certain number of segments (say 32), the segments are being merged and optimized? Thanks in advance. Serkan On Tue, Feb 14, 2017 at 7:03 AM, Nick Wellnhofer wrote: > On 13/02/2017 20:44, Serkan Mulayim wrote: > >> 1- How do we check that the index is healthy for SEARCHING (e.g. creating >> a searcher) without a crash? As I see there is no problem in creating a >> Searcher even if there is a lock (write.lock or merge.lock) >> > > First of all, Lucy should never "crash" in the sense of a segfault. If it > does, this is a bug that should be reported. > > Unless your index is on a shared volume like NFS, it can always be > searched. > > 2- How do we check that the index is healthy for INDEXING (e.g. creating a >> new indexer). I believe if the index is healthy(answer to the first >> question) and there is no LOCK file (e.g. write.lock or merge.lock), then >> we can assume that index is healthy and we can create a new indexer, right. >> (Assuming that there is no write permission issues or no disk space issues) >> > > You can always create a new Indexer. The worst that can happen is that a > LockErr exception is thrown after the Indexer failed to acquire a lock. > Note that by default, Indexer retries to get a lock for 1000 ms (one > second). This can be configured with IndexManager: > > https://lucy.apache.org/docs/c/Lucy/Index/IndexManager.html > > 3- What are the lock types? As far as I see there are only write.lock and >> merge.lock. Are there any others? >> > > This is explained in the documentation: > > https://lucy.apache.org/docs/c/Lucy/Docs/FileLocking.html > > If we close the application calling Lucy before the indexer is destroyed, >> is there an index recovery strategy. >> > > Lucy uses an atomic rename operation when committing data so a crashing > Indexer should never corrupt the index. > > What would the implications of simply deleting write.lock and merge.lock >> be? >> > > In most cases, this shouldn't be necessary. Lucy stores the PID of the > process that created a lock and tries to clear stale lock files from > crashed processes. But this won't work if another processes reuses the PID. > If you're absolutely sure that a lock doesn't belong to an active Indexer, > you can delete the lock directories manually. > > Side note: This could be improved by supporting locking mechanisms that > release locks automatically if a process crashes. But these are > OS-dependent and aren't guaranteed to work reliably over NFS: > > - `fcntl(F_SETLK)` or `lockf` on POSIX (unsuitable for multi-threaded > operation). > - `flock` on BSD, Linux. > - `CreateFile` with a 0 sharing mode on Windows. > > Nick > > >
Re: [lucy-user] C library, how to check index is healthy
On Tue, Feb 14, 2017 at 7:43 AM, Tilghman Lesher wrote: > As another sidenote, there are techniques for reliable exclusive > locking when the datastore is NFS. Namely, instead of using the > default locking mechanisms in Unix, you can use the link(2) system > interface (which is an atomic operation on NFS) with an agreed-upon > name for your lock. For example, if your shared volume was "/shared", > then you could create a temporary file using mkstemp on the volume, > then attempt to link(2) the temporary file to that known lockfile > name, "/shared/lock". If the link succeeds, you have the lock, but if > the operation fails, another process obtained the lock. That is, in fact, what Lucy does internally. https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Store/Lock.c#L188 // Write to a temporary file, then use the creation of a hard link to // ensure atomic but non-destructive creation of the lockfile with its // complete contents. > This method > does require that your processes clean up (i.e. delete) the file when > you want to release the lock, however. Right, and we have some logic to clean up the lockfile automatically. The lock file contains a host name and a PID; if the host name matches AND the pid is not active, we assume that the lockfile can be deleted. This default behavior works pretty well for "typical" use on normal local volumes -- it deletes many stale lockfiles automatically and generally spares users from having to evaluate whether they need to do it themselves. The price is that on NFS and the like you typically need to override the default: to be safe when you have multiple machines trying to write to an index on a shared volume, you must ensure that each Indexer is associated with the proper host name (via IndexManager). (For more info, see http://lucy.apache.org/docs/c/Lucy/Docs/FileLocking .) > 1. Build to /shared/index_123/ (number could also be the PID of the > index-building process). > 2. Delete /shared/index_old/. > 3. Use readlink(2) to grab the current (real) pathname of the index > (/shared/index_122) > 4. cd /shared ; ln -sf index_123/ /shared/index (production path) > 5. Rename the previous index (/shared/index_122) to /shared/index_old/. > > By building the index under a temporary directory name, then swapping > out the directory when we want to put the new index into production, > we avoid the locking problems between readers and writers entirely. I can see how this works, though it is costly if you're building the indexes from scratch each time rather than taking advantage of Lucy's incremental indexing. To speed things up, Lucy could supply a way to copy an entire index near-instantaneously using hard links. (This works because index files, once committed, are never modified -- index content only changes through the addition of new files and eventual deletion of obsolete files.) The interface could look something like this: lucy_Backup *backup = lucy_Backup_new("/path/to/index"); cfish_String *snapshot_name = Lucy_Backup_Get_Snapshot_Name); cfish_String *backup_path = cfish_String_newf("/backupdir/backup_%o", snapshot_name); Lucy_Backup_Hard_Link_Dupe(backup, backup_path); Then the following workflow becomes possible: 1. Use `hard_link_dupe` to create a duplicate index. 2. Add new content to the duped index 3. ln -sf /shared/index_123 /shared/index (production path) 4. Remove old indexes after some timeout. (All searchers must be refreshed on a schedule which guarantees they do not access deleted content, or you'll see `Stale NFS filehandle` exceptions.) Marvin Humphrey
Re: [lucy-user] C library, how to check index is healthy
On Tue, Feb 14, 2017 at 9:03 AM, Nick Wellnhofer wrote: > On 13/02/2017 20:44, Serkan Mulayim wrote: >> What would the implications of simply deleting write.lock and merge.lock >> be? > > > In most cases, this shouldn't be necessary. Lucy stores the PID of the > process that created a lock and tries to clear stale lock files from crashed > processes. But this won't work if another processes reuses the PID. If > you're absolutely sure that a lock doesn't belong to an active Indexer, you > can delete the lock directories manually. > > Side note: This could be improved by supporting locking mechanisms that > release locks automatically if a process crashes. But these are OS-dependent > and aren't guaranteed to work reliably over NFS: > > - `fcntl(F_SETLK)` or `lockf` on POSIX (unsuitable for multi-threaded > operation). > - `flock` on BSD, Linux. > - `CreateFile` with a 0 sharing mode on Windows. As another sidenote, there are techniques for reliable exclusive locking when the datastore is NFS. Namely, instead of using the default locking mechanisms in Unix, you can use the link(2) system interface (which is an atomic operation on NFS) with an agreed-upon name for your lock. For example, if your shared volume was "/shared", then you could create a temporary file using mkstemp on the volume, then attempt to link(2) the temporary file to that known lockfile name, "/shared/lock". If the link succeeds, you have the lock, but if the operation fails, another process obtained the lock. This method does require that your processes clean up (i.e. delete) the file when you want to release the lock, however. When it comes to rebuilding the index, we typically build the index under a temporary directory name, then swap out the directories to the production path using a forced-symlink (ln -sf). As long as the old index is kept for the maximum length of time of a searcher process, there's no danger. In other words: 1. Build to /shared/index_123/ (number could also be the PID of the index-building process). 2. Delete /shared/index_old/. 3. Use readlink(2) to grab the current (real) pathname of the index (/shared/index_122) 4. cd /shared ; ln -sf index_123/ /shared/index (production path) 5. Rename the previous index (/shared/index_122) to /shared/index_old/. By building the index under a temporary directory name, then swapping out the directory when we want to put the new index into production, we avoid the locking problems between readers and writers entirely. -- Tilghman
Re: [lucy-user] C library, how to check index is healthy
Nick Wellnhofer wrote on 2/14/17 9:03 AM: On 13/02/2017 20:44, Serkan Mulayim wrote: 1- How do we check that the index is healthy for SEARCHING (e.g. creating a searcher) without a crash? As I see there is no problem in creating a Searcher even if there is a lock (write.lock or merge.lock) First of all, Lucy should never "crash" in the sense of a segfault. If it does, this is a bug that should be reported. Unless your index is on a shared volume like NFS, it can always be searched. One trick to keep in mind is that if the index underlying a Searcher changes (as through indexing or document deletion), you must detect that change and open a new Searcher. Because of mmap it's very fast to spawn a new Searcher, but sometimes you'll see stale results if you persist one too long. An example of how Dezi does that here: https://metacpan.org/source/KARMAN/Dezi-App-0.014/lib/Dezi/Lucy/Searcher.pm#L406 tl;dr is that Dezi writes its own index metadata header that includes a UUID and timestamp for the last time the index was updated, and checks that UUID against the current Searcher to know if it is stale and needs to be re-created. -- Peter Karman . https://peknet.com/ . https://keybase.io/peterkarman
Re: [lucy-user] C library, how to check index is healthy
On 13/02/2017 20:44, Serkan Mulayim wrote: 1- How do we check that the index is healthy for SEARCHING (e.g. creating a searcher) without a crash? As I see there is no problem in creating a Searcher even if there is a lock (write.lock or merge.lock) First of all, Lucy should never "crash" in the sense of a segfault. If it does, this is a bug that should be reported. Unless your index is on a shared volume like NFS, it can always be searched. 2- How do we check that the index is healthy for INDEXING (e.g. creating a new indexer). I believe if the index is healthy(answer to the first question) and there is no LOCK file (e.g. write.lock or merge.lock), then we can assume that index is healthy and we can create a new indexer, right. (Assuming that there is no write permission issues or no disk space issues) You can always create a new Indexer. The worst that can happen is that a LockErr exception is thrown after the Indexer failed to acquire a lock. Note that by default, Indexer retries to get a lock for 1000 ms (one second). This can be configured with IndexManager: https://lucy.apache.org/docs/c/Lucy/Index/IndexManager.html 3- What are the lock types? As far as I see there are only write.lock and merge.lock. Are there any others? This is explained in the documentation: https://lucy.apache.org/docs/c/Lucy/Docs/FileLocking.html If we close the application calling Lucy before the indexer is destroyed, is there an index recovery strategy. Lucy uses an atomic rename operation when committing data so a crashing Indexer should never corrupt the index. What would the implications of simply deleting write.lock and merge.lock be? In most cases, this shouldn't be necessary. Lucy stores the PID of the process that created a lock and tries to clear stale lock files from crashed processes. But this won't work if another processes reuses the PID. If you're absolutely sure that a lock doesn't belong to an active Indexer, you can delete the lock directories manually. Side note: This could be improved by supporting locking mechanisms that release locks automatically if a process crashes. But these are OS-dependent and aren't guaranteed to work reliably over NFS: - `fcntl(F_SETLK)` or `lockf` on POSIX (unsuitable for multi-threaded operation). - `flock` on BSD, Linux. - `CreateFile` with a 0 sharing mode on Windows. Nick