Best Practices for Distributing Lucene Indexing and Searching

2005-03-01 Thread Luke Francl
Lucene Users,

We have a requirement for a new version of our software that it run in a
clustered environment. Any node should be able to go down but the
application must keep functioning.

Currently, we use Lucene on a single node but this won't meet our fail
over requirements. If we can't find a solution, we'll have to stop using
Lucene and switch to something else, like full text indexing inside the
database.

So I'm looking for best practices on distributing Lucene indexing and
searching. I'd like to hear from those of you using Lucene in a
multi-process environment what is working for you. I've done some
research, and based on on what I've seen so far, here's a bit of
brainstorming on what seems to be possible:

1. Don't. Have a single indexing and searching node. [Note: this is the
last resort.]

2. Don't distribute indexing. Searching is distributed by storing the
index on NFS. A single indexing node would process all requests.
However, using Lucene on NFS is *not* recommended. See:
http://lucenebook.com/search?query=nfs ...it can result in stale NFS
file handle problem:
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12481.html
So we'd have to investigate this option. Indexing could use an JMS queue
so if the box goes down, when it comes back up, indexing could resume
where it left off.

3. Distribute indexing and searching into separate indexes for each
node. Combine results using ParallelMultiSearcher. If a box went down, a
piece of the index would be unavailable. Also, there would be serious
issues making sure assets are indexed in the right place to prevent
duplicates, stale results, or deleted assets from showing up in the
index. Another possibility would be a hashing scheme for
indexing...assets could be put into buckets based on their
IDs to prevent duplication. Keeping results consistent as you're
changing the number of the buckets as the nodes come up and down would
be a challenge though

4. Distribute indexing and searching, but index everything at each node.
Each node would have a complete copy of the index. Indexing would be
slower. We could move to a 5 or 15 minute batch approach.

5. Index centrally and push updated indexes to search nodes on a
periodic basis. This would be easy and might avoid the problems with
using NFS.

6. Index locally and synchronize changes periodically. This is an
interesting idea and bears looking into. Lucene can combine multiple
indexes into a single one, which can be written out somewhere else, and
then distributed back to the search nodes to replace their existing
index.

7. Create a JDBCDirectory implementation and let the database handle the
clustering. A JDBCDirectory exists
(http://ppinew.mnis.com/jdbcdirectory/), but has only been tested with
MySQL. It would probably require modification (the code is under the
LGPL). At one time, an OracleDirectory implementation existed but that
was in 2000 and so it is surely badly outdated. But in principle, the
concept is possible. However, these database-based directories are
slower at indexing and searching than the traditional style, probably
mostly due to BLOB handling.

8. Can the Berkely DB-based DBDirectory help us? I am not sure what
advantages it would bring over the traditional FSDirectory, but maybe
someone else has some ideas.

Please let me know if you've got any other ideas or a best practice to
follow.

Thanks,
Luke Francl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RangeQuery With Date

2005-02-07 Thread Luke Francl
Your dates need to be stored in lexicographical order for the RangeQuery
to work.

Index them using this date format: MMDD.

Also, I'm not sure if the QueryParser can handle range queries with only
one end point. You may need to create this query programmatically.

Regards,
Luke Francl


Re: Why IndexReader.lastModified(index) is depricated?

2005-01-20 Thread Luke Francl
On Wed, 2005-01-19 at 23:24, Otis Gospodnetic wrote:

 To answer the original question, yes, I think it would be handy to have
 this method back.  Perhaps we should revive it/them, ha?

LIMO and Luke use this method (even though it is deprecated) to show the
user when the index was last updated.

I think it would be nice to have it back, but it should be clearly noted
that it is for informational purposes _only_. If you want to see if the
index has changed, use the version number.

Luke Francl
LIMO co-developer
http://limo.sourceforge.net


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi-threading problem: couldn't delete segments

2005-01-13 Thread Luke Francl
I didn't get any response to this post so I wanted to follow up (you can
read the full description of my problem in the archives:
http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=11986).

Here's an additional piece of information: 

I wrote a small program to confirm that on Windows, you can't rename a
file while another thread has it open.

If I am performing a search, is it possible that the IndexReader is
holding open the segments file when there is an attempt by my indexing
code to overwrite it with File.renameTo()?

Thanks,
Luke Francl

On Thu, 2005-01-06 at 17:43, Luke Francl wrote:
 We are having a problem with Lucene in a high concurrency
 create/delete/search situation. I thought I fixed all these problems,
 but I guess not.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Multi-threading problem: couldn't delete segments

2005-01-13 Thread Luke Francl
On Thu, 2005-01-13 at 12:33, David Townsend wrote:
 Just read your old post. I'm not quite sure whether I've read this correctly. 
  Is the search worker thread also doing deletes from the index 
 
 a test script is going that is hitting the search
 part of our application (I think the script also updates and deletes
 Documents, but I am not sure.
 
 Deleting also locks the index, so maybe the indexwriter is waiting for the 
 search thread to release the lock.

I checked with my co-worker, and his script is doing a search, modifying
assets (which deletes and re-inserts) and then deleting them. This is
going on while new Documents are being added to the index from another
thread. (Due to some weirdness in our application, it is also trying to
delete Documents that don't exist before inserting them -- should be
harmless, though.)

I control access to the index with a lock object during all write
accesses to the index, including deletes.

You can see the code here:
http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=2068605attachId=1

Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do you handle dynamic html pages?

2005-01-10 Thread Luke Francl
On Mon, 2005-01-10 at 10:03, Jim Lynch wrote:
 How is anyone managing reindexing of pages that change?  Just 
 periodically reindex everything or do you try to determine frequency of 
 each changes to each page and/or site? 

If you are using a CMS, your best bet is to integrate Lucene with the
CMS's content update mechanism. That way, your index will always be
up-to-date.

Otherwise, I would say reindexing everything is easiest, provided it
doesn't take too long. If it's ~15 minutes or less, you could schedule a
processes to do it at a low activity period (2 AM or whenever) every day
and that would probably handle your needs.

Regards,
Luke Francl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Use search engine technology for object persistence

2005-01-07 Thread Luke Francl
On Fri, 2005-01-07 at 08:05, Erik Hatcher wrote:
 Interesting article:
 
   http://www.javaworld.com/javaworld/jw-01-2005/jw-0103-search_p.html

Sort of off-topic, but does this mean JavaWorld is publishing again? I
had read Bill Venners's post from back in January '04 that they shut
down.

Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Check to see if index is optimized

2005-01-07 Thread Luke Francl
On Fri, 2005-01-07 at 13:24, Crump, Michael wrote:

 Is there a simple way to check and see if an index is already optimized?
 What happens if optimize is called on an already optimized index - does
 the call basically do a noop?  Or is it still and expensive call?

If an index has no deletions, it does not need to be optimized. You can
find out if it has deletions with IndexReader.hasDeletions.

I am not sure what the cost of optimization is if the index doesn't need
it. Perhaps someone else on this list knows.

Regards,
Luke Francl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Multi-threading problem: couldn't delete segments

2005-01-06 Thread Luke Francl
We are having a problem with Lucene in a high concurrency
create/delete/search situation. I thought I fixed all these problems,
but I guess not.

Here's what's happening.

We are conducting load testing on our application.

On a Windows 2000 server using lucene-1.3-final with compound file
enabled, a worker thread is creating new Documents as it ingests
content. Meanwhile, a test script is going that is hitting the search
part of our application (I think the script also updates and deletes
Documents, but I am not sure. My colleague who wrote it has left for the
day so I can't ask him.).

The scripted test passes with 1, 5, and 10 users hitting the
application. At 20 users, we get this exception:

[Task Worker1] ERROR com.ancept.ams.search.lucene.LuceneIndexer  -
Caught exception closing IndexReader in finally block
java.io.IOException: couldn't delete segments
at
org.apache.lucene.store.FSDirectory.renameFile(FSDirectory.java:236)
at
org.apache.lucene.index.SegmentInfos.write(SegmentInfos.java(Compiled
 Code))
at
org.apache.lucene.index.SegmentReader$1.doBody(SegmentReader.java:179
)
at org.apache.lucene.store.Lock$With.run(Lock.java:148)
at
org.apache.lucene.index.SegmentReader.doClose(SegmentReader.java(Comp
iled Code))
at
org.apache.lucene.index.IndexReader.close(IndexReader.java(Inlined Co
mpiled Code))
at
org.apache.lucene.index.SegmentsReader.doClose(SegmentsReader.java(Co
mpiled Code))
at
org.apache.lucene.index.IndexReader.close(IndexReader.java(Compiled C
ode))
at
com.ancept.ams.search.lucene.LuceneIndexer.delete(LuceneIndexer.java:
266)

All write access to the index is controlled in that LuceneIndexer class
by synchronizing on a static lock object. 

Searching is handled in another part of the code, which creates new
IndexSearchers as necessary when the index changes. I do not rely on
finalization to clean up these searchers because we found it to be
unreliable. I keep track of threads using each searcher and then close
it when that number drops to 0 if the searcher is outdated. 

My problem seems similar to what Robert Leftwich asked about on this
mailing list in January 2001.  

Google Cache:
http://64.233.179.104/search?q=cache:1D4h1vSh5AQJ:www.geocrawler.com/mail/msg.php3%3Fmsg_id%3D5020057++lucene+multithreading+problems+site:geocrawler.comhl=en

Doug Cutting replied to him saying that he should synchronize calls to
IndexReader.open() and IndexReader.close():

Google Cache:
http://64.233.179.104/search?q=cache:arztiytQ42QJ:www.geocrawler.com/archives/3/2624/2001/1/0/5020870/++lucene+multithreading+problems+site:geocrawler.comhl=en

Robert Leftwich then found a problem with his code and eliminated a
second IndexReader that was messing stuff up:

Google Cache:
http://64.233.179.104/search?q=cache:jSIsi6t9KH8J:www.geocrawler.com/mail/msg.php3%3Fmsg_id%3D5037517++lucene+multithreading+problems+site:geocrawler.comhl=en

However, there are differences between Leftwich's design and mine, and
besides, that thread is four years old. (Are there even exisiting
archives for lucene-user throughout 2001 anywhere?)

So any advice would be appreciated.

Do I need to synchronize _all_ IndexReader.open() and
IndexReader.close() calls? Or is it more likely that I'm missing
something in my class that modifies the index? The code is attached.

Thank you,

Luke Francl
// $Id: LuceneIndexer.java 20473 2004-10-19 17:20:10Z lfrancl $
package com.ancept.ams.search.lucene;

import com.ancept.ams.asset.AssetUtils;
import com.ancept.ams.asset.AttributeValue;
import com.ancept.ams.asset.IAsset;
import com.ancept.ams.asset.IAssetIdentifier;
import com.ancept.ams.asset.IAssetList;
import com.ancept.ams.asset.ITimeMetadataAsset;
import com.ancept.ams.asset.IVideoAssetView;
import com.ancept.ams.controller.RelayFactory;
import com.ancept.ams.enums.AttributeNamespace;
import com.ancept.ams.enums.AttributeType;
import com.ancept.ams.enums.TimeMetadataType;
import com.ancept.ams.relay.IAssetRelay;
import com.ancept.ams.search.Indexer;
import com.ancept.ams.search.Fields;
import com.ancept.ams.util.SystemConfig;
import com.ancept.ams.util.PerformanceMonitor;

import org.apache.log4j.Logger;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.snowball.SnowballAnalyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;

import java.io.File;
import java.io.IOException;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.Iterator;
import java.util.List;

/**
 * Controls access to the Lucene index.
 *
 * @author Luke Francl
 **/
public final class LuceneIndexer implements Indexer {

private static final Logger l4j = Logger.getLogger

Re: LIMO problems

2004-12-09 Thread Luke Francl
On Thu, 2004-12-09 at 07:32, Daniel Cortes wrote:
 Hi, I'm tying Limo (Index Monitor of Lucene) and I have a problem, 
 obviously it will be a silly problem but now I don't
 have solution.
 Someone can tell me how structure it have limo.properties file?
 because I have any example thanks.
 If you know another web-aplication for administration Lucenes Index say me.
 Thanks for all, and excuse me for my silly questions.

Daniel,

Julien or I will be happy to help you, but I need more information. What
version of LIMO are you using?

In LIMO 0.5.2, Julien added a new feature which allows you to configure
the LIMO web application while it is running through the limo.properties
file.

This file is in the standard Java properties file format:

index name=filesystem location

However, you shouldn't need to care about this detail, as there is a
method to add indexes from the web application.

If you have any other questions, please don't hesitate to ask.

Regards,
Luke Francl
LIMO developer

P.S.: LIMO 0.5.2 adds a new index file browser that shows you some
interesting details about your index files. Check it out!


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: LIMO problems

2004-12-09 Thread Luke Francl
On Thu, 2004-12-09 at 10:07, Daniel Cortes wrote:
 I've the last version of LIMO.
 It is running in a Tomcat and I can't add any Index and don't load the 
 index that I create the index before from console (java 
 org.apache.lucene.demo.IndexFiles ...)
 This is the reasson that I demand the structure of limo.properties 
 because this file don't exist and this maner I can force to load the 
 localitation of de Index File.
 Thanks for your time.

Ah, this probably means that LIMO cannot write to this location. If you
give the user you are running Tomcat as permission to write files to
your webapps/limo.war directory (or whatever it's called, I don't
actually use Tomcat), it should work.

If you don't want to do that for security reasons, simply create the
file and put it there yourself. It should be at the same level as the
index.jsp file.

Regards,
Luke Francl
LIMO developer


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: partial updating of lucene

2004-12-09 Thread Luke Francl
On Thu, 2004-12-09 at 09:00, Erik Hatcher wrote:

 Have a look at the tool Luke (Google for luke lucene :) and see how 
 it does its Reconstruct and Edit facility.  It is possible, though 
 potentially lossy, to reconstruct a document and add it again.

Or look at LIMO's implementation of that feature, which to my eyes is a
little easier to read (of course that's probably because I wrote it...
;):

http://cvs.sourceforge.net/viewcvs.py/limo/limo/src/net/sourceforge/limo/LimoUtils.java?rev=1.6view=markup

(check out LimoUtils.reconstructDocument())

However, if you're doing analysis on your text to remove stopwords and
stuff like that, this WILL be lossy. I consider it more of an aid for
debugging than a way to re-index documents, though I suppose it would
work for that as well. However, I believe the process would be highly
resource intensive so I wouldn't recommend it.

The better solution is to add a stored keyword field that stores the
location of your document, and then re-index it from the source.

Regards,
Luke Francl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document-Map, Hits-List

2004-12-01 Thread Luke Francl
On Wed, 2004-12-01 at 10:27, Otis Gospodnetic wrote:

 This is very similar to what I do - I create a List of Maps from Hits
 and its Documents.  So I think this change may be handy, if doable (I
 didn't look into changing the two Lucene classes, actually).


How do you avoid the problem Eric just mentioned, iterating through all
the Hits at once to populate this data structure?

I do a similar thing, creating a List of asset references from a field
in each Lucene Document in my Hits list (actual data for display
retrieved from a separate datastore). I was not aware of any performance
problems from doing this, but now I am wondering about the implications.

Thanks,
Luke


Re: modifying existing index

2004-11-23 Thread Luke Francl
On Tue, 2004-11-23 at 13:59, Santosh wrote:
 I am using lucene for indexing, when I am creating Index the docuemnts are 
 added. but when I want to modify the single existing document and reIndex 
 again, it is taking as new document and adding one more time, so that I am 
 getting same document twice in the results.
 To overcome this I am deleting existing Index and again recreating whole 
 Index. but is it possibe to index  the modified document again and overwrite 
 existing document without deleting and recreation. can I do this? If so how? 

You do not need to recreate the whole index. Just mark the document as
deleted using the IndexReader and then add it again with the
IndexWriter. Remember to close your IndexReader and IndexWriter after
doing this.

The deleted document will be removed the next time you optimize your
index.

Luke Francl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Limo 0.5

2004-11-22 Thread Luke Francl
On Mon, 2004-11-22 at 02:27, Chandrashekhar wrote:
 Hi,
 
 With Limo 0.5 , can i find out if certain word from some Document is indexed 
 or not?

This feature doesn't exist as such.

You could search for it and if results come up, then the word is in the
documents it returns.

I'll add enumerating the terms in an index to my list of things to add.

Regards,
Luke Francl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



LIMO 0.5 released

2004-11-21 Thread Luke Francl
I am pleased to announce that version 0.5 of LIMO, the Lucene Index Monitor, 
has been released.

LIMO is a web application that allows you to browse your Lucene indexes 
remotely. It is an ideal companion for Lucene applications that run in a 
servlet container.

The 0.5 release adds some cool new features such as:

* More index summary statistics, including index version number, deletion 
status, number of documents, number of fields, number of indexed fields, and 
number of unindexed fields.
* Querying the index.
* Display expanded wild card and range queries (using Query.rewrite) with term 
count so you can see how many terms a complex query is expanded to. This is 
particularly helpful if you are trying to track down an annoying TooManyClauses 
exception.
* Query timing to show how expensive queries are.
* Estimated query memory consumption (as given by the formula in this message: 
http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1757461).
* Query result count.
* Query result explanation.
* Stored field reconstruction as in Luke.
* Highlighting of matching terms in search results and reconstructed documents 
using Mark Harwood's library.

LIMO requires Java 1.4 or later and a servlet container.

Download it from SourceForge: http://sourceforge.net/projects/limo/ 

LIMO is still ready to go out of the box (er, war file). Just edit the web.xml 
to point LIMO to your indexes.

Thanks to Julien Nioche for starting a great and very useful project and 
letting me join it; and to Andrzej Bialecki for Luke from which I appropriated 
several ideas and his GrowableStringArray class. If you are interested in 
getting involved, LIMO is now available in SourceForge CVS.

Regards,
Luke Francl


RE: disadvantages

2004-11-21 Thread Luke Francl
Well that really depends on how big your index is and what they search for, now 
doesn't it? ;)


-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Sun 11/21/2004 2:52 PM
To: Lucene Users List
Subject: Re: disadvantages
 
On Nov 21, 2004, at 12:00 PM, Miguel Angel wrote:
 What are disadvantages the Lucene??

The users of your system won't have time to get coffee when running 
searches.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Too many files exception

2004-11-18 Thread Luke Francl
On Thu, 2004-11-18 at 07:09, Neelam Bhatnagar wrote:
 Hello all, 
  
 We have been using Lucene 3.1 version with Tomcat 4.0 and jdk1.4. 
 It seems that sometimes we see a Too many files open exception which
 completely garbles the whole index and whole search functionality
 crashes on the web site. It has also been known to crash the complete
 JSP container of tomcat. 

(I'm assuming you meant Lucene 1.3)

This exception happens when your process has too many file handles open.
Values differ by operating system.

With Lucene, this is caused by having a number of IndexReaders open.
Each IndexReader will open each file in your index. If you do not close
your IndexReaders, this exception can happen, especially if you have a
lot of heap and the IndexReaders are not getting garbage collected.

My guess is that you are creating a new IndexSearcher for each search
request and then not closing it after the search is complete. 

Lucene 1.3 added a feature called compound index files that much
alleviates this problem by greatly reducing the number of files required
in an index. You can use it by turning on IndexWriter.useCompoundFile(
true ):

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#getUseCompoundFile()

Combined with closing your IndexReaders, this should fix the problem.

Regards,
Luke Francl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: _4c.fnm missing

2004-11-16 Thread Luke Francl
On Tue, 2004-11-16 at 14:57, Luke Shannon wrote:
 This is the latest error I have received:
 
 IndexReader out of date and no longer valid for delete, undelete, or setNorm
 operations

What you need to do is check the version number of the index to
determine if you need to open a new IndexReader for deletes. Remember,
this must be synchronized with the same lock you are using to control
access to the index for locks.

Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: BooleanQuery - TooManyClauses Issue

2004-11-16 Thread Luke Francl
On Tue, 2004-11-16 at 16:32, Paul Elschot wrote:

 Once you approach 1000 days, you'll get the same problem again,
 so you might want to use a filter for the dates.
 See DateFilter and the archives on MMDD.

Can anyone point to a good example of how to use the DateFilter?

Thanks,
Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Index File

2004-11-15 Thread Luke Francl
On Fri, 2004-11-12 at 19:07, Richard Greenane wrote:
 You might wat to look at LUKE @ http://www.getopt.org/luke/
 A great tool for checking the index to make sure that everything is
 there

There is also a web-based tool that you can run in your servlet
container called LIMO. I've added some query features to it in CVS,
which you can check out from Sourceforge:
http://sourceforge.net/projects/limo

But I will second what Otis said: you must (or rather your colleague
must) check to see if the index has been updated before a search (use
IndexReader.getCurrentVersion), and if it is, close the IndexSearcher
and create a new one.

Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index File

2004-11-15 Thread Luke Francl
On Mon, 2004-11-15 at 09:52, Luke Shannon wrote:
 Once this was modified to create a new IndexerSearch for every search
 request, all my problems went away.

Be careful with this. You could conceivably run out of file handles.
This problem got a lot better in Lucene 1.3 with the compound file
format, it could still happen if you have a lot of heap and aren't
garbage collecting very often. So close the old one when you're done
with it.

Also, creating a new IndexSearcher only when the index has been modified
will give you a performance boost because you do not have to open the
index with every search. 

Luke Francl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene : avoiding locking (incremental indexing)

2004-11-15 Thread Luke Francl
This is how I implemented incremental indexing. If anyone sees anything
wrong, please let me know.

Our motivation is similar to John Eichel's. We have a digital asset
management system and when users update, delete or create a new asset,
they need to see their results immediately.

The most important thing to know about incremental indexing that
multiple threads cannot share the same IndexWriter, and only one
IndexWriter can be open on an index at a time.

Therefore, what I did was control access to the IndexWriter through a
singleton wrapper class that synchronizes access to the IndexWriter and
IndexReader (for deletes). After finishing writing to the index, you
must close the IndexWriter to flush the changes to the index.

If you do this you will be fine.

However, opening and closing the index takes time so we had to look for
some ways to speed up the indexing.

The most obvious thing is that you should do as much work as possible
outside of the synchronized block. For example, in my application, the
creation of Lucene Document objects is not synchronized. Only the part
of the code that is between your IndexWriter.open() and
IndexWriter.close() needs to be synchronized.

The other easy thing I did to improve performance was batch changes in a
transaction together for indexing. If a user changes 50 assets, that
will all be indexed using one Lucene IndexWriter.

So far, we haven't had to explore further performance enhancements, but
if we do the next thing I will do is create a thread that gathers assets
that need to be indexed and performs a batch job every five minutes or
so.

Hope this is helpful,
Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene : avoiding locking

2004-11-12 Thread Luke Francl
Luke,

I also integrated Lucene into a content management application with
incremental updates and ran into the same problem you did.

You need to make sure only one process (which means, no multiple copies
of the application writing to the index simultaneously) or thread ever
writes to the index. That includes deletes as in your code below, so
make sure that is synchronized, too.

Also, you will find that opening and closing the index for writing is
very costly, especially on a large index, so it pays to batch up all
changes in a transaction (inserts and deletes) together in one go at the
Lucene index. If this still isn't enough, you can batch up 5 minutes
worth of changes and apply them at once. We haven't got to that point
yet.

I am curious, though, how many people on this list are using Lucene in
the incremental update case. Most examples I've seen all assume batch
indexing.

Regards,

Luke Francl



On Thu, 2004-11-11 at 18:33, Luke Shannon wrote:
 Syncronizing the method didn't seem to help. The lock is being detected
 right here in the code:
 
 while (uidIter.term() != null
uidIter.term().field() == uid
uidIter.term().text().compareTo(uid)  0) {
  //delete stale docs
  if (deleting) {
   reader.delete(uidIter.term());
  }
  uidIter.next();
 }
 
 This runs fine on my own site so I am confused. For now I think I am going
 to remove the deleting of stale files etc and just rebuild the index each
 time to see what happens.
 
 - Original Message - 
 From: [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Thursday, November 11, 2004 6:56 PM
 Subject: Re: Lucene : avoiding locking
 
 
  I'm working on a similar project...
  Make sure that only one call to the index method is occuring at
  a time.  Synchronizing that method should do it.
 
  --- Luke Shannon [EMAIL PROTECTED] wrote:
 
   Hi All;
  
   I have hit a snag in my Lucene integration and don't know what
   to do.
  
My company has a content management product. Each time
   someone changes the
directory structure or a file with in it that portion of the
   site needs to
be re-indexed so the changes are reflected in future searches
   (indexing
   must
happen during run time).
  
I have written a Indexer class with a static Index() method.
   The idea is
   too
call the method every time something changes and the index
   needs to be
re-examined. I am hoping the logic put in by Doug Cutting
   surrounding the
UID will make indexing efficient enough to be called so
   frequently.
  
This class works great when I tested it on my own little site
   (I have about
2000 file). But when I drop the functionality into the QA
   environment I get
a locking error.
  
I can't access the stack trace, all I can get at is a log
   file the
application writes too. Here is the section my class wrote.
   It was right in
the middle of indexing and bang lock issue.
  
I don't know if the problem is in my code or something in the
   existing
application.
  
Error Message:
ENTER|SearchEventProcessor.visit(ContentNodeDeleteEvent)
|INFO|INDEXING INFO: Start Indexing new content.
|INFO|INDEXING INFO: Index Folder Did Not Exist. Start
   Creation Of New
   Index
|INFO|INDEXING INFO: Beginnging Incremental update
   comparisions
|INFO|INDEXING INFO: Beginnging Incremental update
   comparisions
|INFO|INDEXING INFO: Beginnging Incremental update
   comparisions
|INFO|INDEXING INFO: Beginnging Incremental update
   comparisions
|INFO|INDEXING INFO: Beginnging Incremental update
   comparisions
|INFO|INDEXING INFO: Beginnging Incremental update
   comparisions
|INFO|INDEXING INFO: Beginnging Incremental update
   comparisions
|INFO|INDEXING INFO: Beginnging Incremental update
   comparisions
|INFO|INDEXING INFO: Beginnging Incremental update
   comparisions
|INFO|INDEXING INFO: Beginnging Incremental update
   comparisions
|INFO|INDEXING INFO: Beginnging Incremental update
   comparisions
|INFO|INDEXING INFO: Beginnging Incremental update
   comparisions
|INFO|INDEXING INFO: Beginnging Incremental update
   comparisions
|INFO|INDEXING INFO: Beginnging Incremental update
   comparisions
|INFO|INDEXING INFO: Beginnging Incremental update
   comparisions
|INFO|INDEXING INFO: Beginnging Incremental update
   comparisions
|INFO|INDEXING INFO: Beginnging Incremental update
   comparisions
|INFO|INDEXING ERROR: Unable to index new content Lock obtain
   timed out:
  
  
 
 Lock@/usr/tomcat/jakarta-tomcat-5.0.19/temp/lucene-398fbd170a5457d05e2f4d432
10f7fe8-write.lock
  
   |ENTER|UpdateCacheEventProcessor.visit(ContentNodeDeleteEvent)
  
Here is my code. You will recognize it pretty much as the
   IndexHTML class
from the Lucene demo written by Doug Cutting. I have put a
   ton of comments
in a attempt to understand what is going on.
  
Any help would

Re: Lucene : avoiding locking

2004-11-12 Thread Luke Francl
On Fri, 2004-11-12 at 09:51, Luke Shannon wrote:
 Hi Luke;
 
 Currently I am experimenting with checking if the index is lock using
 IndexReader.locked before creating a writer. If this turns out to be the
 case I was thinking of just unlocking the file.
 
 Do you think this is a good strategy?

No, because if the index is locked, that means another thread or process
is writing to it.

If you're getting spurious locks, stop your application and clean our
the /tmp/ directory (you should see files named *lucene* -- these are
the lock files).

Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-12 Thread Luke Francl
On Thu, 2004-11-11 at 14:48, Daniel Naber wrote:
 On Thursday 11 November 2004 20:57, Sanyi wrote:
 
  What I'm saying is that there is no reason for the optimizer to expand
  wild* to more than 1024 variations
 
 That's the point: there is no query optimizer in Lucene.

Would it be possible to write one? I would be very interested in this
feature. 

I poked around in the index and search packages today to see if it could
be done. I think it would take a big change in the Query.rewrite and
related code in the IndexReaders to make the results of the required and
prohibited parts of the query available. 

Again, I don't know if that's even possible. But it would be a great
feature.

Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-12 Thread Luke Francl
On Fri, 2004-11-12 at 14:52, Daniel Naber wrote:

 There are two different issues: first, reorder the query so that those 
 terms with less matches appear first, because as soon as the first term 
 with 0 matches occurs, search stops. There will probably be a 
 non-so-difficult implementation for that, but this will have more overhead 
 than it saves time I guess.

It could be done only with searches that have expansions (RangeQuery,
WildcardQuery, etc) to prevent unnecessary work...

 The other thing is that prefix queries get expanded first, then the search 
 happens. And that TooManyQueries exception happens when expanding the 
 query, not during search. I'm not sure, but I think that's difficult to 
 change, at least in a clean way.

Expansion (and thus TooManyClauses) happens during Query.weight(), which
is right before the search. Maybe after it's rewriten the query could be
tested for instanceof BooleanQuery and then see if
BooleanQuery.getClauses().length  BooleanQuery.maxClauseCount?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



What is the difference between these searches?

2004-11-09 Thread Luke Francl
Hi,

I've implemented a converter to translate our system's internal Query
objects to Lucene's Query model.

I recently realized that my implementation of OR NOT was not working
as I would expect and I was wondering if anyone on this list could give
me some advice.

I am converting a query that means foo or not bar into the following:

+item_type:xyz +(field_name:foo -field_name:bar)

This returns only Documents where field_name contains foo. I would
expect it to return all the Documents where field_name contains foo or
field_name doesn't contain bar.

Fiddling around with the Lucene Index Toolbox, I think that this query
does what I want:

+item_type:xyz field_name:foo -field_name:bar

Can someone explain to me why these queries return different results?

Thanks,
Luke Francl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the difference between these searches?

2004-11-09 Thread Luke Francl
On Tue, 2004-11-09 at 15:48, Erik Hatcher wrote:

 This last query has a required clause, which is what BooleanQuery 
 requires when there is a NOT clause.  You're getting what you want here 
 because you've got an item_type:xyz clause as required.  In your first 
 example, you're requiring field_name:foo, whereas in this one it is not 
 mandatory.

So, essentially, my query:

+item_type:xyz +(field_name:foo -field_name:bar)

Gets translated to:

+item_type:xyz +field_name:foo -field_name:bar

Whereas the more lenient one does not require field_name:foo and returns
what I expect.

Is that right?

Now, to decide whether to try to make this work the way I thought it
would, or just document that it doesn't. ;)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the difference between these searches?

2004-11-09 Thread Luke Francl
On Tue, 2004-11-09 at 16:00, Paul Elschot wrote:

 Lucene has no provision for matching by being prohibited only. This can
 be achieved by indexing something for each document that can be
 used in queries to match always, combined with something prohibited
 in a query.
 But doing this is bad for performance for querying larger nrs of docs.

I'm familiar with Lucene's restrictions on prohibited queries, and I
have a required clause for a field that will always be part of the query
(it's not a nonsense value, it's the item type of the object in a CMS). 

My problem is that I have been considering the whole query object that
I've generated. Every BooleanQuery that's a part of my finished query
must also have a required clause if it has a prohibited clause.

I'm thinking of refactoring my code so that instead of joining together
Query objects into a large BooleanQuery, it passes around BooleanClauses
and assembles them into a single BooleanQuery.

Thanks for your help,
Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Thread safety of QueryParser

2003-08-26 Thread Luke Francl
Thank you for the update, Doug.

On Tue, 2003-08-26 at 11:57, Doug Cutting wrote:

 This method constructs a new query parser each time it is called, so it 
 is thread safe.

Perhaps the JGuru FAQ should be updated...

Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Thread safety of QueryParser

2003-08-25 Thread Luke Francl
[Note: I sent this email before I recieved my subscription confirmation 
message and I have not seen it in the archives yet. If you recieved this 
message twice, my appologies. -- Luke]

According to the jGuru FAQ, QueryParser is not thread safe:

http://www.jguru.com/faq/view.jsp?EID=492389

However, this information is several years old. Is this still true?

The answer to the question suggests using a new parser for every thread,
but the QueryParser.parse(String query,String field,Analyzer analyzer) 
method is static, and I don't see any way to set the default field on an
instance of the QueryParser.

Is that what the f parameter of the QueryParser(String f, Analyzer a)
constrcutor is for?

Thanks for your advice,

Luke Francl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]