Re: Fast access to a random page of the search results.
Stanislav Jordanov wrote: startTs = System.currentTimeMillis(); dummyMethod(hits.doc(nHits - nHits)); stopTs = System.currentTimeMillis(); System.out.println(Last doc accessed in + (stopTs - startTs) + ms); 'nHits - nHits' always equals zero. So you're actually printing the first document, not the last. The last document would be accessed with 'hits.doc(nHits)'. Accessing the last document should not be much slower (or faster) than accessing the first. 200+ milliseconds to access a document does seem slow. Where is you index stored? On a local hard drive? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fast access to a random page of the search results.
Daniel Naber wrote: After fixing this I can reproduce the problem with a local index that contains about 220.000 documents (700MB). Fetching the first document takes for example 30ms, fetching the last one takes 100ms. Of course I tested this with a query that returns many results (about 50.000). Actually it happens even with the default sorting, no need to sort by some specific field. In part this is due to the fact that Hits first searches for the top-scoring 100 documents. Then, if you ask for a hit after that, it must re-query. In part this is also due to the fact that maintaining a queue of the top 50k hits is more expensive than maintaining a queue of the top 100 hits, so the second query is slower. And in part this could be caused by other things, such as that the highest ranking document might tend to be cached and not require disk io. One could perform profiling to determine which is the largest factor. Of these, only the first is really fixable: if you know you'll need hit 50k then you could tell this to Hits and have it perform only a single query. But the algorithmic cost of keeping the queue of the top 50k is the same as collecting all the hits and sorting them. So, in part, getting hits 49,990 through 50,000 is inherently slower than getting hits 0-10. We can minimize that, but not eliminate it. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Best Practices for Distributing Lucene Indexing and Searching
Yonik Seeley wrote: 6. Index locally and synchronize changes periodically. This is an interesting idea and bears looking into. Lucene can combine multiple indexes into a single one, which can be written out somewhere else, and then distributed back to the search nodes to replace their existing index. This is a promising idea for handling a high update volume because it avoids all of the search nodes having to do the analysis phase. A clever way to do this is to take advantage of Lucene's index file structure. Indexes are directories of files. As the index changes through additions and deletions most files in the index stay the same. So you can efficiently synchronize multiple copies of an index by only copying the files that change. The way I did this for Technorati was to: 1. On the index master, periodically checkpoint the index. Every minute or so the IndexWriter is closed and a 'cp -lr index index.DATE' command is executed from Java, where DATE is the current date and time. This efficiently makes a copy of the index when its in a consistent state by constructing a tree of hard links. If Lucene re-writes any files (e.g., the segments file) a new inode is created and the copy is unchanged. 2. From a crontab on each search slave, periodically poll for new checkpoints. When a new index.DATE is found, use 'cp -lr index index.DATE' to prepare a copy, then use 'rsync -W --delete master:index.DATE index.DATE' to get the incremental index changes. Then atomically install the updated index with a symbolic link (ln -fsn index.DATE index). 3. In Java on the slave, re-open 'index' it when its version changes. This is best done in a separate thread that periodically checks the index version. When it changes, the new version is opened, a few typical queries are performed on it to pre-load Lucene's caches. Then, in a synchronized block, the Searcher variable used in production is updated. 4. In a crontab on the master, periodically remove the oldest checkpoint indexes. Technorati's Lucene index is updated this way every minute. A mergeFactor of 2 is used on the master in order to minimize the number of segments in production. The master has a hot spare. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 1.4.x TermInfosWriter.indexInterval not public static ?
Chris Hostetter wrote: 1) If making it mutatable requires changes to other classes to propogate it, then why is it now an instance variable instead of a static? (Presumably making it an instance variable allows subclasses to override the value, but if other classes have internal expectations of the value, that doesn't seem safe) Its an instance variable because it can vary from instance-to-instance. This value is specified when an index segment is written, and subsequently read from disk and used when reading that segment. It's an instance variable in both the writing and reading code. The thing that's lacking is a way to pass in alternate values to the writing code. The reason that other classes are involved is that the reading and writing code are in non-public classes. We don't want to expose the implementation too much by making these public, but would rather expose these as getter/setter methods on the relevant public API. 2) Should it be configurable through a get/set method, or through a system property? (which rehashes the instance/global question) That's indeed the question. My guess is that a system property would be probably be sufficient for most, but perhaps not for all. Similarly with a static setter/getter. But a getter/setter on IndexWriter would make everyone happy. 3) Is it important that a writer updating an existing index use the same value as the writer that initial created the index? if so should there really be a preferedIndexInterval variable which is mutatable, and a currentIndexInterval which is set to the value of the index currently being updated. Such that preferedIndexInterval is used when making an index from scratch and currentIndexInterval is used when adding segments to a new index? It's used whenever an index segment is created. Index segments are created when documents are added and when index segments are merged to form larger index segments. Merging happens frequently while indexing. Optimization merges all segments. The value can vary in each segment. The default value is probably good for all but folks with very large indexes, who may wish to increase the default somewhat. Also folks with smaller indexes and very high query volumes may wish to decrease the default. It's a classic time/memory tradeoff. Higher values use less memory and make searches a bit slower, smaller values use more memory and make searches a bit faster. Unless there are objections I will add this as: IndexWriter.setTermIndexInterval() IndexWriter.getTermIndexInterval() Both will be marked Expert. Further discussion should move to the lucene-dev list. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 1.4.x TermInfosWriter.indexInterval not public static ?
Kevin A. Burton wrote: Whats the desired pattern of using of TermInfosWriter.indexInterval ? There isn't one. It is not a part of the public API. It is an unsupported internal feature. Do I have to compile my own version of Lucene to change this? Yes. The last API was public static final but this is not public nor static. It was never public. It used to be static and final, but is now an instance variable. I'm wondering if we should just make this a value that can be set at runtime. Considering the memory savings for larger installs this can/will be important. The place to put getter/setters would be IndexWriter, since that's the public home of all other index parameters. Some changes to DocumentWriter and SegmentMerger would be required to pass this value through to TermInfosWriter from IndexWriter. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??
Kevin A. Burton wrote: I finally had some time to take Doug's advice and reburn our indexes with a larger TermInfosWriter.INDEX_INTERVAL value. It looks like you're using a pre-1.4 version of Lucene. Since 1.4 this is no longer called TermInfosWriter.INDEX_INTERVAL, but rather TermInfosWriter.indexInterval. Is this setting incompatible with older indexes burned with the lower value? Prior to 1.4, yes. After 1.4, no. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??
Kevin A. Burton wrote: Is this setting incompatible with older indexes burned with the lower value? Prior to 1.4, yes. After 1.4, no. What happens after 1.4? Can I take indexes burned with 256 (a greater value) in 1.3 and open them up correctly with 1.4? Not without hacking things. If your 1.3 indexes were generated with 256 then you can modify your version of Lucene 1.4+ to use 256 instead of 128 when reading a Lucene 1.3 format index (SegmentTermEnum.java:54 today). Prior to 1.4 this was a constant, hardwired into the index format. In 1.4 and later each index segment stores this value as a parameter. So once 1.4 has re-written your index you'll no longer need a modified version. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Javadoc error?
Mark Woon wrote: The javadoc for Field.setBoost() claims: The boost is multiplied by |Document.getBoost()| http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Document.html#getBoost%28%29 of the document containing this field. If a document has multiple fields with the same name, all such values are multiplied together. However, from what I can tell from IndexSearcher.explain(), multiple fields with the same name have their boost values added together. It might very well be that I'm misinterprating what I'm seeing from explain(), but if I'm not, then either the javadoc is wrong or there's a bug somewhere... Does anyone know which way it's actually supposed to work? Boosts for multiple fields with the same name in the a document are multiplied together at index time to form the boost for that field of that document. At search time, if multiple query terms from the same field match the same document, then that document's field boost is multiplied into the score for both terms, and these scores are then added. If boost(field,doc) is the boost, and raw(term,doc) is the raw, unboosted score (I'm simplifying things) then the score for a two term query is something like: boosted(t1,t2,d) = boost(t1.field,d)*raw(t1,d) + boost(t2.field,d)*raw(t2,d) which, when t1 and t2 are in the same field, is equivalent to: boosted(t1,t2,d) = boost(field,d)*(raw(t1,d) + raw(t2,d)) The explain() feature prints things in the first form, where the boosts appear in separate components of a sum. Does that help? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Iterate through all the document ids in the index?
William Lee wrote: is there a simple and fast way to get a list of document IDs through the lucene index? I can use a loop to iterate from 0 to IndexReader.maxDoc and check whether an the document id is valid through IndexReader.document(i), but this would imply that I have to retrieve the documents fields. Use IndexReader.isDeleted() to check if each id is valid. This is quite fast. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Kevin A. Burton wrote: 1. Do I have to do this with a NEW directory? Our nightly index merger uses an existing target index which I assume will re-use the same settings as before? I did this last night and it still seems to use the same amount of memory. Above you assert that I should use a new empty directory and I'll try that tonight. You need to re-write the entire index using a modified TermIndexWriter.java. Optimize rewrites the entire index but is destructive. Merging into a new empty directory is a non-destructive way to do this. 2. This isn't destructive is it? I mean I'll be able to move BACK to a TermInfosWriter.indexInterval of 128 right? Yes, you can go back if you re-optimize or re-merge again. Also, there's no need to CC my personal email address. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: new segment for each document
Daniel Naber wrote: On Thursday 10 February 2005 22:27, Ravi wrote: I tried setting the minMergeFactor on the writer to one. But it did not work. I think there's an off-by-one bug so two is the smallest value that works as expected. You can simply create a new IndexWriter for each add and then close it. IndexWriter is pretty lightweight, so this shouldn't have too much overhead. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Reconstruct segments file?
Ian Soboroff wrote: Speaking of Counter, I have a dumb question. If the segments are named using an integer counter which is incremented, what is the point in converting that counter into a string for the segment filename? Why not just name the segments e.g. 1.frq, etc.? The names are prefixed with an underscore, since it turns out that some filesystems have trouble (DOS?) with certain all-digit names. Other than that, they are integers, just with a large radix. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Reconstruct segments file?
Ian Soboroff wrote: I've looked over the file formats web page, and poked at a known-good segments file from a separate, similar index using od(1) and such. I guess what I'm not sure how to do is to recover the SegSize from the segment I have. The SegSize should be the same as the length in bytes of any of the .f[0-9]+ files in the segment. If your segment is in compound format then you can use IndexReader.main() in the current SVN version to list the files and sizes in the .cfs file, including its contained .f[0-9]+ files. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Disk space used by optimize
Yura Smolsky wrote: There is a big difference when you use compound index format or multiple files. I have tested it on the big index (45 Gb). When I used compound file then optimize takes 3 times more space, b/c *.cfs needs to be unpacked. Now I do use non compound file format. It needs like twice as much disk space. Perhaps we should add something to the javadocs noting this? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Kevin A. Burton wrote: Is there any way to reduce this footprint? The index is fully optimized... I'm willing to take a performance hit if necessary. Is this documented anywhere? You can increase TermInfosWriter.indexInterval. You'll need to re-write the .tii file for this to take effect. The simplest way to do this is to use IndexWriter.addIndexes(), adding your index to a new, empty, directory. This will of course take a while for a 60GB index... Doubling TermInfosWriter.indexInterval should half the Term memory usage and double the time required to look up terms in the dictionary. With an index this large the the latter is probably not an issue, since processing term frequency and proximity data probably overwhelmingly dominate search performance. Perhaps we should make this public by adding an IndexWriter method? Also, you can list the size of your .tii file by using the main() from CompoundFileReader. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sort Performance Problems across large dataset
Peter Hollas wrote: Currently we can issue a simple search query and expect a response back in about 0.2 seconds (~3,000 results) with the Lucene index that we have built. Lucene gives a much more predictable and faster average query time than using standard fulltext indexing with mySQL. This however returns result in score order, and not alphabetically. To sort the resultset into alphabetical order, we added the species names as a seperate keyword field, and sorted using it whilst querying. This solution works fine, but is unacceptable since a query that returns thousands of results can take upwards of 30 seconds to sort them. Are you using a Lucene Sort? If you reuse the same IndexReader (or IndexSearcher) then perhaps the first query specifying a Sort will take 30 seconds (although that's much slower than I'd expect), but subsequent searches that sort on the same field should be nearly as fast as results sorted by score. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ParallellMultiSearcher Vs. One big Index
Ryan Aslett wrote: What I found was that for queries with one term (First Name), the large index beat the multiple indexes hands down (280 Queries/per second vs 170 Q/s). But for queries with multiple terms (Address), the multiple indexes beat out the Large index. (26 Q/s vs 16 Q/s) Btw, Im running these on a 2 proc box with 16GB of ram. So what Im trying to determine Is if there is some equations out there that can help me find the sweet spot for splitting my indexes. What appears to be the bottleneck, CPU or i/o? Is your test system multi-threaded? I.e., is it attempting to execute many queries in parallel? If you're CPU-bound then a single index should be fastest. Are you using compound format? If you're i/o-bound, the non-compound format may be somewhat faster, as it permits more parallel i/o. Is the index data on multiple drives? If you're i/o bound then it should be faster to use multiple drives. To permit even more parallel i/o over multiple drives you might consider using a pool of IndexReaders. That way, with, e.g., striped data, each could be simultaneously reading different portions of the same file. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to add a Lucene index to a jar file?
David Spencer wrote: Isn't ZipDirectory the thing to search for? I think it's actually URLDirectory: http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg02453.html Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: multi-threaded thru-put in lucene
John Wang wrote: 1 thread: 445 ms. 2 threads: 870 ms. 5 threads: 2200 ms. Pretty much the same numbers you'd get if you are running them sequentially. Any ideas? Am I doing something wrong? If you're performing compute-bound work on a single-processor machine then threading should give you no better performance than sequential, perhaps a bit worse. If you're performing io-bound work on a single-disk machine then threading should again provide no improvement. If the task is evenly compute and i/o bound then you could achieve at best a 2x speedup on a single CPU system with a single disk. If you're compute-bound on an N-CPU system then threading should optimally be able to provide a factor of N speedup. Java's scheduling of compute-bound theads when no threads call Thread.sleep() can also be very unfair. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: multi-threaded thru-put in lucene
John Wang wrote: Is the operation IndexSearcher.search I/O or CPU bound if I am doing 100's of searches on the same query? CPU bound. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 1.4.3 breaks 1.4.1 QueryParser functionality
Bill Janssen wrote: Sure, if I wanted to ship different code for each micro-release of Lucene (which, you might guess, I don't). That signature doesn't compile with 1.4.1. Bill, most folks bundle appropriate versions of required jars with their applications to avoid this sort of problem. How are you deploying things? Are you not bundling a compatible version of the lucene jar with each release of your application? If not, why not? I'm not trying to be difficult, just trying to understand. Thanks, Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: CFS file and file formats
Steve Rajavuori wrote: 1) First of all, there are both CFS files and standard (non-compound) files in this directory, and all of them have recent update dates, so I assume they are all being used. My code never explicitly sets the compound file flag, so I don't know how this happened. This can happen if your application crashes while the index was being updated. In this case these were never entered into the segments file and may be partially written. 2) Is there a way to force all files into compound mode? For example, if I set the compound setting, then call optimize, will that recreate everything into the CFS format? It should. Except, on Windows not all old CFS file will be deleted immediately, but may instead be listed in the 'deleteable' file for a while. 3) There are several other large .CFS files in this directory that I think have somehow become detached from the index. They have recent update dates -- however, the last time I ran optimize these were not touched, and they are not being updated now. I know these segments have valid data, because now when I search I am missing large chunks of data -- which I assume is in these detached segments. So my thought is to edit the 'segments' file to make Lucene recognize these again -- but I need to know the correct segment size in order to do this. So how do I determine what the correct segment size should be? These could also be the result of crashes. In this case they may be partially written. The safest approach is to remove files not mentioned in the segments file and update the index with the missing documents. How does your application recover if it crashes during an update? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: CFS file and file formats
Steve Rajavuori wrote: There are around 20 million documents in the orphaned segments, so it would take a very long time to update the index. Is there an unsafe way to edit the segments file to add these back? It seems like the missing piece of information I need to do this is the correct segment size -- where can I find that? Do the CFS and non-CVS segment names correspond? If so, then it probably crashed after the segment was complete, but perhaps before it was packed into a CFS file. So I'd trust the non-CFS stuff first. And it's easy to see the size of a non-CVS segement: it's just the number of bytes in each of the .f* files. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Word co-occurrences counts
Andrew Cunningham wrote: computer dog~50 looks like what I'm after - now is there someway I can call this and pull out the number of total occurances, not just the number of documents hits? (say if computer and dog occur near each other several times in the same document). You could use a custom Similarity implementation for this query, where tf() is the identity function, idf() returns 1.0, etc., so that the final score is the occurance count. You'll need to divide by Similarity.decodeNorm(indexReader.norms(field)[doc]) at the end to get rid of the lengthNorm() and field boost (if any). Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Word co-occurrences counts
Doug Cutting wrote: You could use a custom Similarity implementation for this query, where tf() is the identity function, idf() returns 1.0, etc., so that the final score is the occurance count. You'll need to divide by Similarity.decodeNorm(indexReader.norms(field)[doc]) at the end to get rid of the lengthNorm() and field boost (if any). Much simpler would be to build a SpanNearQuery, call getSpans(), then loop, counting how many times Spans.next() returns true. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: To Sort or not to Sort
Scott Smith wrote: 1. Simply use the built-in lucene sort functionality, cache the hit list and then page through the list. Adv: looks pretty straight forward, I write less code. Dis: for searches that return a large number of hits (having a search return several hundred to a few thousand hits is not uncommon), Lucene is sorting a lot of entries that don't really need to be sorted (because the user will never look at them) and sorting tends to be expensive. 2. The other solution uses a priority heap to collect the top N (or next N) entries. I still have to walk the entire hit list, but keeping entries in a priority heap means I can determine the N entries I need with a few comparisons and minimal sorting. I don't have to sort a bunch of entries whose order I don't care about. Additionally, I don't have to have all of the entries in memory at one time. The big disadvantage with this is that I have to write more code. However, it may be worth it if the performance difference is large enough. Lucene's built-in sorting code already performs the optimization you describe as (2). So don't bother re-inventing it! Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: A question about scoring function in Lucene
Chuck Williams wrote: I believe the biggest problem with Lucene's approach relative to the pure vector space model is that Lucene does not properly normalize. The pure vector space model implements a cosine in the strictly positive sector of the coordinate space. This is guaranteed intrinsically to be between 0 and 1, and produces scores that can be compared across distinct queries (i.e., 0.8 means something about the result quality independent of the query). I question whether such scores are more meaningful. Yes, such scores would be guaranteed to be between zero and one, but would 0.8 really be meaningful? I don't think so. Do you have pointers to research which demonstrates this? E.g., when such a scoring method is used, that thresholding by score is useful across queries? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: A question about scoring function in Lucene
Otis Gospodnetic wrote: There is one case that I can think of where this 'constant' scoring would be useful, and I think Chuck already mentioned this 1-2 months ago. For instace, having such scores would allow one to create alert applications where queries run by some scheduler would trigger an alert whenever the score is X. So that is where the absolue value of the score would be useful. Right, but the question is, would a single score threshold be effective for all queries, or would one need a separate score threshold for each query? My hunch is that the latter is better, regardless of the scoring algorithm. Also, just because Lucene's default scoring does not guarantee scores between zero and one does not necessarily mean that these scores are less meaningful. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: A question about scoring function in Lucene
Chris Hostetter wrote: For example, using the current scoring equation, if i do a search for Doug Cutting and the results/scores i get back are... 1: 0.9 2: 0.3 3: 0.21 4: 0.21 5: 0.1 ...then there are at least two meaningful pieces of data I can glean: a) document #1 is significantly better then the other results b) document #3 and #4 are both equaly relevant to Doug Cutting If I then do a search for Chris Hostetter and get back the following results/scores... 9: 0.9 8: 0.3 7: 0.21 6: 0.21 5: 0.1 ...then I can assume the same corrisponding information is true about my new search term (#9 is significantly better, and #7/#8 are equally as good) However, I *cannot* say either of the following: x) document #9 is as relevant for Chris Hostetter as document #1 is relevant to Doug Cutting y) document #5 is equally relevant to both Chris Hostetter and Doug Cutting That's right. Thanks for the nice description of the issue. I think the OP is arguing that if the scoring algorithm was modified in the way they suggested, then you would be able to make statements x y. And I am not convinced that, with the changes Chuck describes, one can be any more confident of x and y. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: java.io.FileNotFoundException: ... (No such file or directory)
Justin Swanhart wrote: The indexes are located on a NFS mountpoint. Could this be the problem? Yes. Lucene's lock mechanism is designed to keep this from happening, but the sort of lock files that FSDirectory uses are known to be broken with NFS. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: reoot site query results
In web search, link information helps greatly. (This was Google's big discovery.) There are lots more links that point to http://www.slashdot.org/ than to http://www.slashdot.org/xxx/yyy, and many (if not most) of these links have the term slashdot, while links to http://www.slashdot.org/xxx/yyy are somewhat less likely to contain the term slashdot. As Erik hinted, Nutch uses this information. It keeps has a database of links that point to each page, indexes their anchor text along with the page, and boosts highly linked pages more than lesser linked pages. Doug Chris Fraschetti wrote: My lucene implementation works great, its basically an index of many web crawls. The main thing my users complain about is say a search for slashdot will return the http://www.slashdot.org/soem_dir/somepage.asp as the top result because the factors i have scoring it determine it as so... but obviously in true search engine fashion.. i would like http://www.slashdot.org/ to be the very top result... i've added a boost to queries that match the hostname field, which helped a little, but obviously not a proper solution. Does anyone out there in the search engine world have a good schema for determining root websites and applying a huge boost to them in one fashion or another? mainly so it appears before any sub pages? (assuming the query is in reference to that site) ... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Recommended values for mergeFactor, minMergeDocs, maxMergeDocs
Chuck Williams wrote: I've got about 30k documents and have 3 indexing scenarios: 1. Full indexing and optimize 2. Incremental indexing and optimize 3. Parallel incremental indexing without optimize Search performance is critical. For both cases 1 and 2, I'd like the fastest possible indexing time. For case 3, I'd like minimal pauses and no noticeable degradation in search performance. Based on reading the code (including the javadocs comments), I'm thinking of values along these lines: mergeFactor: 1000 during Full indexing, and during optimize (for both cases 1 and 2); 10 during incremental indexing (cases 2 and 3) 1000 is too big of a mergeFactor for any practical purpose. I don't see a point in using different mergeFactors in cases 1 and 2. If you're going to optimize before you search, then you want the fastest batch indexing mode. I would use something like 50 for both cases 1 and 2. For case 3, where unoptimized search performance is very important, I would use something smaller than 10. For Technorati's blog search, which incrementally maintains a Lucene index with millions of documents, I used a mergeFactor of 2 in order to maximize search performance. Indexing performance on a single CPU is still adequate to keep up with the rate of change of today's blogosphere. minMergeDocs: 1000 during Full indexing, 10 during incremental indexing I see no reason to lower this when indexing incrementally. 1000 is a good value for high performance indexing when RAM is plentiful and documents are not too large. maxMergeDocs: Integer.MAX_VALUE during full indexing, 1000 during incremental indexing 1000 seems low to me, as it will result in too many segments, slowing search. Here one should select the largest value that can be merged in the maximum time delay permitted in your application between a new document arriving and it appearing in search results. So how up-to-date must your index be? If it's okay for it to ocassionally be a few minutes out of date, then you can probably safely increase this to at least tens or hundreds of thousands, perhaps even millions. When incrementally indexing, the most recently added segments stay cached in RAM by the filesystem. So, on a system with a gigabyte of RAM that's dedicated to incremental indexing, you might safely set maxMergeDocs to account for a few hundred megabytes of index without encountering slow, i/o-bound merges. Since mergeFactor is used in both addDocument() and optimize(), I'm thinking of using two different values in case 2: 10 during the incremental indexing, and then 1000 during the optimize. Is changing the value like this going to cause a problem? It should not cause problems to use different mergeFactors at different times. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Too many open files issue
John Wang wrote: In the Lucene code, I don't see where the reader speicified when creating a field is closed. That holds on to the file. I am looking at DocumentWriter.invertDocument() It is closed in a finally clause on line 170, when the TokenStream is closed. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numeric Range Restrictions: Queries vs Filters
Hoss wrote: The attachment contains my RangeFilter, a unit test that demonstrates it, and a Benchmarking unit test that does a side-by-side comparison with RangeQuery [6]. If developers feel that this class is useful, then by all means roll it into the code base. (90% of it is cut/pasted from DateFilter/RangeQuery anyway) +1 DateFilter could be deprecated, and replaced with the more generally and appropriately named RangeFilter. Should we also deprecate DateField, in preference for DateTools? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Backup strategies
Christoph Kiehl wrote: I'm curious about your strategy to backup indexes based on FSDirectory. If I do a file based copy I suspect I will get corrupted data because of concurrent write access. My current favorite is to create an empty index and use IndexWriter.addIndexes() to copy the current index state. But I'm not sure about the performance of this solution. How do you make your backups? A safe way to backup is to have your indexing process, when it knows the index is stable (e.g., just after calling IndexWriter.close()), make a checkpoint copy of the index by running a shell command like cp -lpr index index.YYYMMDDHHmmSS. This is very fast and requires little disk space, since it creates only a new directory of hard links. Then you can separately back this up and subsequently remove it. This is also a useful way to replicate indexes. On the master indexing server periodically perform cp -lpr as above. Then search slaves can use rsync to pull down the latest version of the index. If a very small mergefactor is used (e.g., 2) then the index will have only a few segments, so that searches are fast. On the slave, periodically find the latest index.YYYMMDDHHmmSS, use cp -lpr index/ index.YYYMMDDHHmmSS and 'rsync --delete master:index.YYYMMDDHHmmSS index.YYYMMDDHHmmSS' to efficiently get a local copy, and finally ln -fsn index.YYYMMDDHHmmSS index to publish the new version of the index. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: document ID and performance
Yan Pujante wrote: I want to run a very fast search that simply returns the matching document id. Is there any way to associate the document id returned in the hit collector to the internal document ID stored in the index ? Anybody has any idea how to do that ? Ideally you would want to be able to write something like this: document.add(Field.ID(documentID)); and then in the HitCollector API: collect(String documentID, float score) with the documentID being the one you stored (but which would be returned very efficiently) Have a look at: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/FieldCache.html In your HitCollector, access an array, from the field cache, that maps Lucene ids to your ids. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search speed
Jeff Munson wrote: Single word searches return pretty fast, but when I try phrases, searching seems to slow considerably. [ ... ] However, if I use this query, contents:all parts including picture tube guaranteed, it returns hits in 2890 millseconds. Other phrases take longer as well. You could use an analyzer that inserts bigrams for common terms. Nutch does this. So, if you declare that all and including are common terms, then this could be tokenized as the following tokens: 0 - all all.parts 1 - parts parts.including 2 - including including.picture 3 - picture 4 - tube 5 - guaranteed Two tokens at a position indicate where the second has position increment of zero. Then your phrase search could be converted to: all.parts parts.including including.picture picture tube guaranteed which should be much faster, since it has replaced common terms with rare terms. This approach does make the index larger, and hence makes indexing somewhat slower. So you don't want to declare too many words as common, but a handful can make a big difference if they're used frequently in queries. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting and score ordering
Paul Elschot wrote: Along with that, is there a simple way to assign a new scorer to the searcher? So I can use the same lucene algorithm for my hits, but tweak it a little to fit my needs? There is no one to one relationship between a seacher and a scorer. But you can use a different Similarity implementation with each Searcher. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sort regeneration in multithreaded server
Stephen Halsey wrote: I was wondering if anyone could help with a problem (or should that be challenge?) I'm having using Sort in Lucene over a large number of records in multi-threaded server program on a continually updated index. I am using lucene-1.4-rc3. A number of bugs with the sorting code have been fixed since that release. Can you please try with 1.4.2 and see if you still have the problem? Thanks. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: locking problems
Aad Nales wrote: 1. can I have one or multiple searchers open when I open a writer? 2. can I have one or multiple readers open when I open a writer? Yes, with one caveat: if you've called the IndexReader methods delete(), undelete() or setNorm() then you may not open an IndexWriter until you've closed that IndexReader instance. In general, only a single object may modify an index at once, but many may access it simultaneously in a read-only manner, including while it is modified. Indexes are modified by either an IndexWriter or by the IndexReader methods delete(), undelete() and setNorm(). Typically an application which modifies and searches simultaneously should keep the following open: 1. A single IndexReader instance used for all searches, perhaps opened via an IndexSearcher. Periodically, as the index changes, this is discarded, and replaced with a new instance. 2. Either: a. An IndexReader to delete documents. b. An IndexWriter to add documents; or So an updating thread might open (2a), delete old documents, close it, then open (2b) add new documents, perhaps optimize, then close. At this point, when the index has been updated (1) can be discarded and replaced with a new instance. Typically the old instance of (1) is not explicitly closed, rather the garbage collector closes it when the last thread searching it completes. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: multifield-boolean vs singlefield-enum query performance
Tea Yu wrote: For the following implementations: 1) storing boolean strings in fields X and Y separately 2) storing the same info in a field XY as 3 enums: X, Y, B, N meaning only X is True, only Y is True, both are True or both are False Is there significant performance gain when we substitute X:T OR Y:T by XY:B, while significant loss in X:T by XY:X OR XY:B? Or are they negligible? As with most performance questions, it's best to try both and measure! It depends on the size of your index, the relative frequencies of X and Y, etc. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: removing duplicate Documents from Hits
Timm, Andy (ETW) wrote: Hello, I've searched on previous posts on this topic but couldn't find an answer. I want to query my index (which are a number of 'flattened' Oracle tables) for some criteria, then return Hits such that there are no Documents that duplicate a particular field. In the case where table A has a one-to-many relationship to table B, I get one Document for each (A1-B1, A1-B2, A1-B3...). My index needs to have each of these records as 'B' is a searchable field in the index. However, after the query is executed, I want my resulting Hits on be unique on 'A'. I'm only returning the Oracle object ID, so once I've seen it once I don't need it again. It looks like some sort of custom Filter is in order. I'd suggest a HitCollector that uses a FieldCache of the A values to check for duplicates, and collect only a the best document id for each value of A. This would use a bit of RAM, but be very fast. http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/HitCollector.html http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/FieldCache.html Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
new release: 1.4.2
There's a new release of Lucene, 1.4.2, which mostly fixes bugs in 1.4.1. Details are at http://jakarta.apache.org/lucene/. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problem with get/setBoost of document fields
Bastian Grimm [Eastbeam GmbH] wrote: that works... but i have to do this setNorm() for each document, which has been indexed up to now, right? there are round about 1 mio. docs in the index... i dont think it's a good idea to perform a search and do it for every doc (and every field of the doc...). is there any possibility to do something like: setNorm(alldocs, fieldX, 2.0f) - a global boost for a named field for every doc. setNorm() is quite fast. Calling it 1M times will not take long. a last question: lucene creates some .f[1-9] after setNorm() has finished. does this file remain all the time in this folder? i tried to optimize and so one but nothing happend. If you add or remove documents and optimize then these will go away. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Shouldnt IndexWriter.flushRamSegments() be public? or at least protected?
Christian Rodriguez wrote: Now the problem I have is that I dont have a way to force a flush of the IndexWriter without closing it and I need to do that before commiting a transaction or I would get random errors. Shouldnt that function be public, in case the user wants to force a flush at some point that is not when the IndexWriter is closed? If not I am forced to create a new IndexWriter and close it EVERY TIME I commit a transaction (which in my application is very often). Opening and closing IndexWriters should be a lightweight operation. Have you tried this and found it to be too slow? A flush() would have to do just about the same work. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problem with get/setBoost of document fields
You can change field boosts without re-indexing. http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#setNorm(int,%20java.lang.String,%20byte) Doug Bastian Grimm [Eastbeam GmbH] wrote: thanks for your reply, eric. so i am right that its not possible to change the boost without reindexing all files? thats not good... or is it ok only to change the boosts an optimize the index to take changes effecting the index? if not, will i be able to boost those fields in the searcher? thanks, bastian - The boost is not thrown away, but rather combined with the length normalization factor during indexing. So while your actual boost value is not stored directly in the index, it is taken into consideration for scoring appropriately. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: demo HTML parser question
[EMAIL PROTECTED] wrote: We were originally attempting to use the demo html parser (Lucene 1.2), but as you know, its for a demo. I think its threaded to optimize on time, to allow the calling thread to grab the title or top message even though its not done parsing the entire html document. That's almost right. I originally wrote it that way to avoid having to ever buffer the entire text of the document. The document is indexed while it is parsed. But, as observed, this has lots of problems and was probably a bad idea. Could someone provide a patch that removes the multi-threading? We'd simply use a StringBuffer in HTMLParser.jj to collect the text. Calls to pipeOut.write() would be replaced with text.append(). Then have the HTMLParser's constructor parse the page before returning, rather than spawn a thread, and getReader() would return a StringReader. The public API of HTMLParser need not change at all and lots of complex threading code would be thrown away. Anyone interested in coding this? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document contents split among different Fields
Greg Langmead wrote: Am I right in saying that the design of Token's support for highlighting really only supports having the entire document stored as one monolithic contents Field? No, I don't think so. Has anyone tackled indexing multiple content Fields before that could shed some light? Do you need highlights from all fields? If so, then you can use: TextFragment[] getBestTextFragments(TokenStream, ...); with a TokenStream for each field, then select the highest scoring fragments across all fields. Would that work for you? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Running OutOfMemory while optimizing and searching
John Z wrote: We have indexes of around 1 million docs and around 25 searchable fields. We noticed that without any searches performed on the indexes, on startup, the memory taken up by the searcher is roughly 7 times the .tii file size. The .tii file is read into memory as per the code. Our .tii files are around 8-10 MB in size and our startup memory foot print is around 60-70 MB. Then when we start doing our searches, the memory goes up, depending on the fields we search on. We are noticing that if we start searching on new fields, the memory kind of goes up. Doug, Your calculation below on what is taken up by the searcher, does it take into account the .tii file being read into memory or am I not making any sense ? 1 byte * Number of searchable fields in your index * Number of docs in your index plus 1k bytes * number of terms in query plus 1k bytes * number of phrase terms in query You make perfect sense. The formula above does not include the .tii. My mistake: I forgot that. By default, every 128th Term in the index is read into memory, to permit random access to terms. These are stored in the .tii file, compressed. So it is not surprising that they require 7x the size of the .tii file in memory. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Andrzej Bialecki wrote: I was wondering about the way you build the n-gram queries. You basically don't care about their position in the input term. Originally I thought about using PhraseQuery with a slop - however, after checking the source of PhraseQuery I realized that this probably wouldn't be that fast... You use BooleanQuery and start/end boosts instead, which may give similar results in the end but much cheaper. Sloppy PhraseQuery's are slower than BooleanQueries, but not horribly slower. The problem is that they don't handle the case where phrase elements are missing altogether, while a BooleanQuery does. So what you really need is maybe a variation of a sloppy PhraseQuery that scores matches that do not contain all of the terms... Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: frequent terms - Re: combining open office spellchecker with Lucene
David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms (recursive and descent) and suggests alternatives to recursize...thus if any term is in the index, regardless of frequency, it is left as-is. I guess you're saying that, if the user enters a term that appears in the index and thus is sort of spelled correctly ( as it exists in some doc), then we use the heuristic that any sufficiently large doc collection will have tons of misspellings, so we assume that rare terms in the query might be misspelled (i.e. not what the user intended) and we suggest alternativies to these words too (in addition to the words in the query that are not in the index at all). Almost. If the user enters a recursize purser, then: a, which is in, say, 50% of the documents, is probably spelled correctly and recursize, which is in zero documents, is probably mispelled. But what about purser? If we run the spell check algorithm on purser and generate parser, should we show it to the user? If purser occurs in 1% of documents and parser occurs in 5%, then we probably should, since parser is a more common word than purser. But if parser only occurs in 1% of the documents and purser occurs in 5%, then we probably shouldn't bother suggesting parser. If you wanted to get really fancy then you could check how frequently combinations of query terms occur, i.e., does purser or parser occur more frequently near descent. But that gets expensive. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
It sounds like the ThreadLocal in TermInfosReader is not getting correctly garbage collected when the TermInfosReader is collected. Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is that you're running in an older JVM. Is that right? I've attached a patch which should fix this. Please tell me if it works for you. Doug Daniel Taurat wrote: Okay, that (1.4rc3)worked fine, too! Got only 257 SegmentTermEnums for 1900 objects. Now I will go for the final test on the production server with the 1.4rc3 version and about 40.000 objects. Daniel Daniel Taurat schrieb: Hi all, here is some update for you: I switched back to Lucene 1.3-final and now the number of the SegmentTermEnum objects is controlled by gc again: it goes up to about 1000 and then it is down again to 254 after indexing my 1900 test-objects. Stay tuned, I will try 1.4RC3 now, the last version before FieldCache was introduced... Daniel Rupinder Singh Mazara schrieb: hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread main java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) -Original Message- From: Daniel Taurat [mailto:[EMAIL PROTECTED] Sent: 10 September 2004 14:42 To: Lucene Users List Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: Daniel Taurat [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel Well, it seems not to be files, it looks more like those SegmentTermEnum objects accumulating in memory. #I've seen some discussion on these objects in the developer-newsgroup that had taken place some time ago. I am afraid this is some kind of runaway caching I have to deal with. Maybe not correctly addressed in this newsgroup, after all... Anyway: any idea if there is an API command to re-init caches? Thanks, Daniel
Re: MultiFieldQueryParser seems broken... Fix attached.
Daniel Naber wrote: On Thursday 09 September 2004 18:52, Doug Cutting wrote: I have not been able to construct a two-word query that returns a page without both words in either the content, the title, the url or in a single anchor. Can you? Like this one? konvens leitseite Leitseite is only in the title of the first match (www.gldv.org), konvens is only in the body. Good job finding that! I guess I should fix Nutch's BasicQueryFilter. Thanks, Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: frequent terms - Re: combining open office spellchecker with Lucene
David Spencer wrote: Doug Cutting wrote: And one should not try correction at all for terms which occur in a large proportion of the collection. I keep thinking over this one and I don't understand it. If a user misspells a word and the did you mean spelling correction algorithm determines that a frequent term is a good suggestion, why not suggest it? The very fact that it's common could mean that it's more likely that the user wanted this word (well, the heuristic here is that users frequently search for frequent terms, which is probabably wrong, but anyway..). I think you misunderstood me. What I meant to say was that if the term the user enters is very common then spell correction may be skipped. Very common words which are similar to the term the user entered should of course be shown. But if the user's term is very common one need not even attempt to find similarly-spelled words. Is that any better? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
Bill Janssen wrote: I'd think that if a user specified a query cutting lucene, with an implicit AND and the default fields title and author, they'd expect to see a match in which both cutting and lucene appears. That is, (title:cutting OR author:cutting) AND (title:lucene OR author:lucene) Your proposal is certainly an improvement. It's interesting to note that in Nutch I implemented something different. There, a search for cutting lucene expands to something like: (+url:cutting^4.0 +url:lucene^4.0 +url:cutting lucene~2147483647^4.0) (+anchor:cutting^2.0 +anchor:lucene^2.0 +anchor:cutting lucene~4^2.0) (+content:cutting +content:lucene +content:cutting lucene~2147483647) So a page with cutting in the body and lucene in anchor text won't match: the body, anchor or url must contain all query terms. A single authority (content, url or anchor) must vouch for all attributes. Note that Nutch also boosts matches where the terms are close together. Using ~2147483647 permits them to be anywhere in the document, but boosts more when they're closer and in-order. (The ~4 in anchor matches is to prohibit matches across different anchors. Each anchor is separated by a Token.positionIncrement() of 4.) But perhaps this is not a feature. Perhaps Nutch should instead expand this to: +(url:cutting^4.0 anchor:cutting^2.0 content:cutting) +(url:lucene^4.0 anchor:lucene^2.0 content:lucene) url:cutting lucene~2147483647^4.0 anchor:cutting lucene~4^2.0 content:cutting lucene~2147483647 That would, e.g., permit a match with only lucene in an anchor and cutting in the content, which the earlier formulation would not. Can anyone tell whether Google has this requirement? I have not been able to construct a two-word query that returns a page without both words in either the content, the title, the url or in a single anchor. Can you? If you're interested, the Nutch query expansion code in question is: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/query-basic/src/java/net/nutch/searcher/basic/BasicQueryFilter.java?view=markup To play with it you can download Nutch and use the command: bin/nutch net.nutch.searcher.Query http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1798116 Yes, the approach there is similar. I attempted to complete the solution and provide a working replacement for MultiFieldQueryParser. But, inspired by that message, couldn't MultiFieldQueryParser just be a subclass of QueryParser that overrides getFieldQuery()? Cheers, Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: combining open office spellchecker with Lucene
Aad Nales wrote: Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to be to create an Analyzer based on the spellchecker of OpenOffice. My question is: has anybody tried this before? Note that a spell checker used with a search engine should use collection frequency information. That's to say, only corrections which are more frequent in the collection than what the user entered should be displayed. Frequency information can also be used when constructing the checker. For example, one need never consider proposing terms that occur in very few documents. And one should not try correction at all for terms which occur in a large proportion of the collection. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: combining open office spellchecker with Lucene
David Spencer wrote: Good heuristics but are there any more precise, standard guidelines as to how to balance or combine what I think are the following possible criteria in suggesting a better choice: Not that I know of. - ignore(penalize?) terms that are rare I think this one is easy to threshold: ignore matching terms that are rarer than the term entered. - ignore(penalize?) terms that are common This, in effect, falls out of the previous criterion. A term that is very common will not have any matching terms that are more common. As an optimization, you could avoid even looking for matching terms when a term is very common. - terms that are closer (string distance) to the term entered are better This is the meaty one. - terms that start w/ the same 'n' chars as the users term are better Perhaps. Are folks really better at spelling the beginning of words? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: maximum index size
Chris Fraschetti wrote: I've seen throughout the list mentions of millions of documents.. 8 million, 20 million, etc etc.. but can lucene potentially handle billions of documents and still efficiently search through them? Lucene can currently handle up to 2^31 documents in a single index. To a large degree this is limited by Java ints and arrays (which are accessed by ints). There are also a few places where the file format limits things to 2^32. On typical PC hardware, 2-3 word searches of an index with 10M documents, each with around 10k of text, require around 1 second, including index i/o time. Performance is more-or-less linear, so that a 100M document index might require nearly 10 seconds per search. Thus, as indexes grow folks tend to distribute searches in parallel to many smaller indexes. That's what Nutch and Google (http://www.computer.org/micro/mi2003/m2022.pdf) do. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: telling one version of the index from another?
Bill Janssen wrote: Hi. Hey, Bill. It's been a long time! I've got a Lucene application that's been in use for about two years. Some users are using Lucene 1.2, some 1.3, and some are moving to 1.4. The indices seem to behave differently under each version. I'd like to add code to my application that checks the current user's index version against the version of Lucene that they are using, and automatically re-indexes their files if necessary. However, I can't figure out how to tell the version, from the index files. Prior to 1.4, there were no format numbers in the index. These are being added, file-by-file, as we change file formats. As you've discovered, there is currently no public API to obtain the format number of an index. Also, the formats of different files are revved at different times, so there may not be a single format number for the entire index. (Perhaps we should remedy this, by, e.g., always revving the segments version whenever any file changes format.) The documentation on the file formats, at http://jakarta.apache.org/lucene/docs/fileformats.html, directs me to the segments file. However, when I look at a version 1.3 segments file, it seems to bear little relationship to the format described in fileformats.html. Have a look at the version of fileformats.html that shipped with 1.3. You can find this by browsing CVS, looking for the 1.3-final tag. But let me do it for you: http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/docs/fileformats.html?rev=1.15 According to CVS tags, that describes both the 1.3 and 1.2 index file formats. But the part of fileformats.html dealing with the segments file contains no compatibility notes, so I assume it hasn't changed since 1.3. I wrote the bit about compatibility notes when I first documented file formats, and then promptly forgot about it. So, until someone contributes them, there are no compatibility notes. Sorry. Even if it had, what's the idea of using -1 as the format number for 1.4? The idea is to promptly break 1.3 and 1.2 code which tries to read the index. Those versions of Lucene don't check format numbers (because there were none). Positive values would give unpredictable errors. A negative value causes an immediate failure. So, anyone know a way to tell the difference between the various versions of the index files? Crufty hacks welcome :-). The first four bytes of the segments file will mostly do the trick. If it is zero or positive, then the index is a 1.2 or 1.3 index. If it is -2, then it's a 1.4-final or later index. There was a change in formats between 1.2 and 1.3, with no format number change. This was in 1.3 RC1 (note #12 in CHANGES.txt). The semantics of each byte in norm files (.f[0-9]) changed. In 1.3 each byte represented 0.0-255.0 on a linear scale. In 1.3 and later they're eight-bit floats (three-bit mantissa, five-bit exponent, no sign bit). The net result is that if you use a 1.2 index with 1.3 or later then the correct documents will be returned, but scores and rankings will be wacky. With the exception of this last bit, 1.4 should be able to correctly handle indexes from earlier releases. Please report if this is not the case. Cheers, Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Possible to remove duplicate documents in sort API?
Kevin A. Burton wrote: My problem is that I have two machines... one for searching, one for indexing. The searcher has an existing index. The indexer found an UPDATED document and then adds it to a new index and pushes that new index over to the searcher. The searcher then reloads and when someone performs a search BOTH documents could show up (including the stale document). I can't do a delete() on the searcher because the indexer doesn't have the entire index as the searcher. I can think of a couple ways to fix this. If the indexer box kept copies of the indexes that it has already sent to the searcher, then it can mark updated documents as deleted in these old indexes. Then you can, with the new index, also distribute new .del files for the old indexes. Alternately, you could, on the searcher box, before you open the new index, open an IndexReader on all of the existing indexes and mark all new documents as deleted in the old indexes. This shouldn't take more than a few seconds. IndexReader.delete() just sets a bit in a bit vector that is written to file by IndexReader.close(). So it's quite fast. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why doesn't Document use a HashSet instead of a LinkedList (DocumentFieldList)
Kevin A. Burton wrote: It looks like Document.java uses its own implementation of a LinkedList.. Why not use a HashMap to enable O(1) lookup... right now field lookup is O(N) which is certainly no fun. Was this benchmarked? Perhaps theres the assumption that since documents often have few fields the object overhead and hashcode overhead would have been less this way. I have never benchmarked this but would be surprised if it makes a measureable difference in any real application. A linked list is used because it naturally supports multiple entries with the same key. A home-grown linked list was used because, when Lucene was first written, java.util.LinkedList did not exist. Please feel free to benchmark this against a HashMap of LinkedList of Field. This would be slower to construct, which may offset any increased access speed. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: speeding up queries (MySQL faster)
Yonik Seeley wrote: Setup info Stats: - 4.3M documents, 12 keyword fields per document, 11 [ ... ] field1:4 AND field2:188453 AND field3:1 field1:4 done alone selects around 4.2M records field2:188453 done alone selects around 1.6M records field3:1 done alone selects around 1K records The whole query normally selects less than 50 records Only the first 10 are returned (or whatever range the client selects). The field1:4 clause is probably dominating the cost of query execution. Clauses which match large portions of the collection are slow to evaluate. If there are not too many different such clauses then you can optimize this by re-using a Filter in place of such clauses, typically a QueryFilter. For example, Nutch automatically translates such clauses into QueryFilters. See: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/searcher/LuceneQueryOptimizer.java?view=markup Note that this only converts clauses whose boost is zero. Since filters do not affect ranking we can only safely convert clauses which do not contribute to the score, i.e, those whose boost is zero. Scores might still be different in the filtered results because of Similarity.coord(). But, in Nutch, Similarity.coord() is overidden to always return 1.0, so that the replacement of clauses with filters does not alter the final scores at all. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NegativeArraySizeException when creating a new IndexSearcher
Looks to me like you're using an older version of Lucene on your Linux box. The code is back-compatible, it will read old indexes, but Lucene 1.3 cannot read indexes created by Lucene 1.4, and will fail in the way you describe. Doug Sven wrote: Hi! I have a problem to port a Lucene based knowledgebase from Windows to Linux. On Windows it works fine whereas I get a NegativeArraySizeException on Linux when I try to initialise a new IndexSearcher to search the index. Deleting and rebuilding the index didn't help. I checked permissions, file path and lock_dir but as far as I can say they seem to be all right. As I couldn't find another one with the same problem I guess I've overlooked sth, but I've run out of ideas. I use lucene-1.4-rc2 and tomcat 5.0.18. Can someone help me please with this or has an idea? Kind regards, Sven java.lang.NegativeArraySizeException at org.apache.lucene.index.TermInfosReader.readIndex(TermInfosReader.java:106) at org.apache.lucene.index.TermInfosReader.init(TermInfosReader.java:82) at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:141) at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:120) at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:118) at org.apache.lucene.store.Lock$With.run(Lock.java:148) at org.apache.lucene.index.IndexReader.open(IndexReader.java:111) at org.apache.lucene.index.IndexReader.open(IndexReader.java:99) at org.apache.lucene.search.IndexSearcher.init(IndexSearcher.java:75) at com.sykon.knowledgebase.action.ListQueryResultAction.act(ListQueryResultActi on.java:134) at org.apache.cocoon.components.treeprocessor.sitemap.ActTypeNode.invoke(ActTyp eNode.java:159) at org.apache.cocoon.components.treeprocessor.sitemap.ActionSetNode.call(Action SetNode.java:121) at org.apache.cocoon.components.treeprocessor.sitemap.ActSetNode.invoke(ActSetN ode.java:98) at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invo keNodes(AbstractParentProcessingNode.java:84) at org.apache.cocoon.components.treeprocessor.sitemap.PreparableMatchNode.invok e(PreparableMatchNode.java:165) at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invo keNodes(AbstractParentProcessingNode.java:107) at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(Pipel ineNode.java:162) at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invo keNodes(AbstractParentProcessingNode.java:107) at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(Pipe linesNode.java:136) at org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcess or.java:371) at org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcess or.java:312) at org.apache.cocoon.components.treeprocessor.sitemap.MountNode.invoke(MountNod e.java:133) at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invo keNodes(AbstractParentProcessingNode.java:84) at org.apache.cocoon.components.treeprocessor.sitemap.PreparableMatchNode.invok e(PreparableMatchNode.java:165) at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invo keNodes(AbstractParentProcessingNode.java:107) at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(Pipel ineNode.java:162) at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invo keNodes(AbstractParentProcessingNode.java:107) at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(Pipe linesNode.java:136) at org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcess or.java:371) at org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcess or.java:312) at org.apache.cocoon.Cocoon.process(Cocoon.java:656) at org.apache.cocoon.servlet.CocoonServlet.service(CocoonServlet.java:1112) at javax.servlet.http.HttpServlet.service(HttpServlet.java:856) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application FilterChain.java:284) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh ain.java:204) at org.apache.catalina.core.ApplicationDispatcher.invoke(ApplicationDispatcher. java:742) at org.apache.catalina.core.ApplicationDispatcher.processRequest(ApplicationDis patcher.java:506) at org.apache.catalina.core.ApplicationDispatcher.doForward(ApplicationDispatch er.java:443) at org.apache.catalina.core.ApplicationDispatcher.forward(ApplicationDispatcher .java:359) at org.apache.jasper.runtime.PageContextImpl.doForward(PageContextImpl.java:712 ) at org.apache.jasper.runtime.PageContextImpl.forward(PageContextImpl.java:682) at org.apache.jsp.knowlegebase.controller_jsp._jspService(controller_jsp.java:8 44) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:133) at javax.servlet.http.HttpServlet.service(HttpServlet.java:856) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:3 11) at
Re: Debian build problem with 1.4.1
I can successfully use gcc 3.4.0 with Lucene as follows: ant jar jar-demo gcj -O3 build/lucene-1.5-rc1-dev.jar build/lucene-demos-1.5-rc1-dev.jar -o indexer --main=org.apache.lucene.demo.IndexHTML ./indexer -create docs It runs pretty snappy too! However I don't know if there's much milage in packaging Lucene as a native library. It's easy enough for folks to compile Lucene this way, and applications built this way are pretty small. The big thing to install is libgcj. Doug Jeff Breidenbach wrote: Ok, Lucene 1.4.1 has been uploaded to Debian. Hopefully it will have enough time to percolate before the sarge release. Now that that is taken care of, I'm curious about the status of gcj compilation. Packaging Lucene as a native library might be useful for projects such as PyLucene, and it is also advantageous for license reasons i.e. avoiding the non-free JVM dependency. What's the current gcj compilation recipe? The best I could find on Google (below) seems a little bit stale. http://www.mail-archive.com/[EMAIL PROTECTED]/msg04131.html Cheers, Jeff - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Hit Score [ Between ]
You could instead use a HitCollector to gather only documents with scores in that range. Doug Karthik N S wrote: Hi Apologies If I want to get all the hits for Scores between 0.5f to 0.8f, I usally use query = QueryParser.parse(srchkey,Fields, analyzer); int tothits = searcher.search(query); for (int i = 0; itothits ; i++) { docs = hits.doc(i); Score = hits.score(i); if ((Score 0.5f ) (Score 0.8f) ) { System.out.println( FileName : + docs.get(filename); } } Is there any other way to Do this , Please Advise me.. Thx. WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Split an existing index into smaller segments without a re-index?
Kevin A. Burton wrote: Is it possible to take an existing index (say 1G) and break it up into a number of smaller indexes (say 10 100M indexes)... I don't think theres currently an API for this but its certainly possible (I think). Yes, it is theoretically possible but not yet implemented. An easy way to implement it would be to subclass FilterIndexReader to return a subset of documents, then use IndexWriter.addIndexes() to write out each subset as a new index. Subsets could be ranges of document numbers, and one could use TermPositions.skipTo() to accelerate the TermPositions subset implementation, but this still wouldn't be quite as fast as an index splitter that only reads each TermPositions once. If we added a lower-level index writing API then one could use that to implement this... Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Caching of TermDocs
John Patterson wrote: I would like to hold a significant amount of the index in memory but use the disk index as a spill over. Obviously the best situation is to hold in memory only the information that is likely to be used again soon. It seems that caching TermDocs would allow popular search terms to be searched more efficiently while the less common terms would need to be read from disk. The operating system already caches recent disk i/o. So what you'd save primarily would be the overhead of parsing the data. However the parsed form, a sequence of docNo and freq ints, is nearly eight times as large as its compressed size in the index. So your cache would consume a lot of memory. Whether it this provide much overall speedup depends on the distribution of common terms in your query traffic. If you have a few terms that are searched very frequently then it might pay off. In my experience with general-purpose search engines this is not usually the case: folks seem to use rarer words in queries than they do in ordinary text. But in some search applications perhaps the traffic is more skewed. Only some experiments would tell for sure. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Logic of score method in hits class
Lucene scores are not percentages. They really only make sense compared to other scores for the same query. If you like percentages, you can divide all scores by the first score and multiply by 100. Doug lingaraju wrote: Dear All How the score method works(logic) in Hits class For 100% match also score is returning only 69% Thanks and regards Raju - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boosting documents
Rob Clews wrote: I want to do the same, set a boost for a field containing a date that lowers as the date is further from now, is there any way I could do this? You could implement Similarity.idf(Term, Searcher) to, when Term.field().equals(date), return a value that is greater for more recent dates. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: over 300 GB to index: feasability and performance issue
Vincent Le Maout wrote: I have to index a huge, huge amount of data: about 10 million documents making up about 300 GB. Is there any technical limitation in Lucene that could prevent me from processing such amount (I mean, of course, apart from the external limits induce by the hardware: RAM, disks, the system, whatever) ? Lucene is in theory able to support up to 2B documents in a single index. Folks have sucessfully built indexes with several hundred million documents. 10 million should not be a problem. If possible, does anyone have an idea of the amount of resource needed: RAM, CPU time, size of indexes, access time on such a collection ? if not, is it possible to extrapolate an estimation from previous benchmarks ? For simple 2-3 term queries, with average sized documents (~10k of text) you should get decent performance (1 second / query) on a 10M document index. An index typically requires around 35% of the plain text size. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sort: 1.4-rc3 vs. 1.4-final
The key in the WeakHashMap should be the IndexReader, not the Entry. I think this should become a two-level cache, a WeakHashMap of HashMaps, the WeakHashMap keyed by IndexReader, the HashMap keyed by Entry. I think the Entry class can also be changed to not include an IndexReader field. Does this make sense? Would someone like to construct a patch and submit it to the developer list? Doug Aviran wrote: I think I found the problem FieldCacheImpl uses WeakHashMap to store the cached objects, but since there is no other reference to this cache it is getting released. Switching to HashMap solves it. The only problem is that I don't see anywhere where the cached object will get released if you open a new IndexReader. Aviran -Original Message- From: Greg Gershman [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 21, 2004 13:13 PM To: Lucene Users List Subject: RE: Sort: 1.4-rc3 vs. 1.4-final I've done a bit more snooping around; it seems that in FieldSortedHitQueue.getCachedComparator(line 153), calls to lookup a stored comparator in the cache always return null. This occurs even for the built-in sort types (I tested it on integers and my code for longs). The comparators don't even appear to be being stored in the HashMap to begin with. Any ideas? Greg --- Aviran [EMAIL PROTECTED] wrote: Since I had to implement sorting in lucene 1.2 I had to write my own sorting using something similar to a lucene's contribution called SortField. Yesterday I did some tests, trying to use lucene 1.4 Sort objects and I realized that my old implementation works 40% faster then Lucene's implementation. My guess is that you are right and there is a problem with the cache although I couldn't find what that is yet. Aviran -Original Message- From: Greg Gershman [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 21, 2004 9:22 AM To: [EMAIL PROTECTED] Subject: Sort: 1.4-rc3 vs. 1.4-final When rc3 came out, I modified the classes used for Sorting to, in addition to Integer, Float and String-based sort keys, use Long values. All I did was add extra statements in 2 classes (SortField and FieldSortedHitQueue) that made a special case for longs, and created a LongSortedHitQueue identical to the IntegerSortedHitQueue, only using longs. This worked as expected; Long values converted to strings and stored in Field.Keyword type fields would be sorted according to Long order. The initial query would take a while, to build the sorted array, but subsequent queries would take little to no time at all. I went back to look at 1.4 final, and noticed the Sort implementation has changed quite a bit. I tried the same type of modifications to the existing source files, but was unable to achieve similiar results. Each subsequent query seems to take a significant amount of time, as if the Sorted array is being rebuilt each time. Also, I tried sorting on an Integer fields and got similar results, which leads me to believe there might be a caching problem somewhere. Has anyone else seen this in 1.4-final? Also, I would like it if Long sorted fields could become a part of the API; it makes sorting by date a breeze. Thanks! Greg Gershman __ Do you Yahoo!? New and Improved Yahoo! Mail - Send 10MB messages! http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Vote for the stars of Yahoo!'s next ad campaign! http://advision.webevents.yahoo.com/yahoo/votelifeengine/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Weighting database fields
Ernesto De Santis wrote: If some field have set a boots value in index time, and when in search time the query have another boost value for this field, what happens? which value is used for boost? The two boosts are both multiplied into the score. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Post-sorted inverted index?
You can define a subclass of FilterIndexReader that re-sorts documents in TermPositions(Term) and document(int), then use IndexWriter.addIndexes() to write this in Lucene's standard format. I have done this in Nutch, with the (as yet unused) IndexOptimizer. http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/indexer/IndexOptimizer.java?view=markup Doug Aphinyanaphongs, Yindalon wrote: I gather from reading the documentation that the scores for each document hit are computed at query time. I have an application that, due to the complexity of the function, cannot compute scores at query time. Would it be possible for me to store the documents in pre-sorted order in the inverted index? (i.e. after the initial index is created, to have a post processing step to sort and reindex the final documents). For example: Document A - score 0.2 Document B - score 0.4 Document C - score 0.6 Thus for the word 'the', the stored order in the index would be C,B,A. Thanks! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Scoring without normalization!
Have you looked at: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html in particular, at: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,%20int) http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html#queryNorm(float) Doug Jones G wrote: Sadly, I am still running into problems Explain shows the following after the modification. Rank: 1 ID: 11285358Score: 5.5740864E8 5.5740864E8 = product of: 8.3611296E8 = sum of: 8.3611296E8 = product of: 6.6889037E9 = weight(title:iron in 1235940), product of: 0.12621856 = queryWeight(title:iron), product of: 7.0507255 = idf(docFreq=10816) 0.017901499 = queryNorm 5.2994613E10 = fieldWeight(title:iron in 1235940), product of: 1.0 = tf(termFreq(title:iron)=1) 7.0507255 = idf(docFreq=10816) 7.5161928E9 = fieldNorm(field=title, doc=1235940) 0.125 = coord(1/8) 2.7106019E-8 = product of: 1.08424075E-7 = sum of: 5.7318403E-9 = weight(abstract:an in 1235940), product of: 0.03711049 = queryWeight(abstract:an), product of: 2.073038 = idf(docFreq=1569960) 0.017901499 = queryNorm 1.5445337E-7 = fieldWeight(abstract:an in 1235940), product of: 1.0 = tf(termFreq(abstract:an)=1) 2.073038 = idf(docFreq=1569960) 7.4505806E-8 = fieldNorm(field=abstract, doc=1235940) 1.0269223E-7 = weight(abstract:iron in 1235940), product of: 0.111071706 = queryWeight(abstract:iron), product of: 6.2046037 = idf(docFreq=25209) 0.017901499 = queryNorm 9.24558E-7 = fieldWeight(abstract:iron in 1235940), product of: 2.0 = tf(termFreq(abstract:iron)=4) 6.2046037 = idf(docFreq=25209) 7.4505806E-8 = fieldNorm(field=abstract, doc=1235940) 0.25 = coord(2/8) 0.667 = coord(2/3) Rank: 2 ID: 8157438 Score: 2.7870432E8 2.7870432E8 = product of: 8.3611296E8 = product of: 6.6889037E9 = weight(title:iron in 159395), product of: 0.12621856 = queryWeight(title:iron), product of: 7.0507255 = idf(docFreq=10816) 0.017901499 = queryNorm 5.2994613E10 = fieldWeight(title:iron in 159395), product of: 1.0 = tf(termFreq(title:iron)=1) 7.0507255 = idf(docFreq=10816) 7.5161928E9 = fieldNorm(field=title, doc=159395) 0.125 = coord(1/8) 0.3334 = coord(1/3) Rank: 3 ID: 10543103Score: 2.7870432E8 2.7870432E8 = product of: 8.3611296E8 = product of: 6.6889037E9 = weight(title:iron in 553967), product of: 0.12621856 = queryWeight(title:iron), product of: 7.0507255 = idf(docFreq=10816) 0.017901499 = queryNorm 5.2994613E10 = fieldWeight(title:iron in 553967), product of: 1.0 = tf(termFreq(title:iron)=1) 7.0507255 = idf(docFreq=10816) 7.5161928E9 = fieldNorm(field=title, doc=553967) 0.125 = coord(1/8) 0.3334 = coord(1/3) Rank: 4 ID: 8753559 Score: 2.7870432E8 2.7870432E8 = product of: 8.3611296E8 = product of: 6.6889037E9 = weight(title:iron in 2563152), product of: 0.12621856 = queryWeight(title:iron), product of: 7.0507255 = idf(docFreq=10816) 0.017901499 = queryNorm 5.2994613E10 = fieldWeight(title:iron in 2563152), product of: 1.0 = tf(termFreq(title:iron)=1) 7.0507255 = idf(docFreq=10816) 7.5161928E9 = fieldNorm(field=title, doc=2563152) 0.125 = coord(1/8) 0.3334 = coord(1/3) I would like to get rid of all normalizations and just have TF and IDF. What am I missing? On Thu, 15 Jul 2004 Anson Lau wrote : If you don't mind hacking the source: In Hits.java In method getMoreDocs() // Comment out the following //float scoreNorm = 1.0f; //if (length 0 scoreDocs[0].score 1.0f) { // scoreNorm = 1.0f / scoreDocs[0].score; //} // And just set scoreNorm to 1. int scoreNorm = 1; I don't know if u can do it without going to the src. Anson -Original Message- From: Jones G [mailto:[EMAIL PROTECTED] Sent: Thursday, July 15, 2004 6:52 AM To: [EMAIL PROTECTED] Subject: Scoring without normalization! How do I remove document normalization from scoring in Lucene? I just want to stick to TF IDF. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Token or not Token, PerFieldAnalyzer
Florian Sauvin wrote: Everywhere in the documentation (and it seems logical) you say to use the same analyzer for indexing and querying... how is this handled on not tokenized fields? Imperfectly. The QueryParser knows nothing about the index, so it does not know which fields were tokenized and which were not. Moreover, even the index does not know this, since you can freely intermix tokenized and untokenized values in a single field. In my case, I have certain fields on which I want the tokenization and anlysis and everything to happen... but on other fields, I just want to index the content as it is (no alterations at all) and not analyze at query time... is that possible? It is very possible. A good way to handle this is to use PerFieldAnalyzerWrapper. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: release migration plan
fp235-5 wrote: I am looking at the code to implement setIndexInterval() in IndexWriter. I'd like to have your opinion on the best way to do it. Currently the creation of an instance of TermInfosWriter requires the following steps: ... IndexWriter.addDocument(Document) IndexWriter.addDocument(Document, Analyser) DocumentWriter.addDocument(String, Document) DocumentWriter.writePostings(Posting[],String) TermInfosWriter.init To give a different value to indexInterval in TermInfosWriter, we need to add a variable holding this value into IndexWriter and DocumentWriter and modify the constructors for DocumentWriter and TermInfosWriter. (quite heavy changes) I think this is the best approach. I would replace other parameters in these constructors which can be derived from an IndexWriter with the IndexWriter. That way, if we add more parameters like this, they can also be passed in through the IndexWriter. All of the parameters to the DocumentWriter constructor are fields of IndexWriter. So one can instead simply pass a single parameter, an IndexWriter, then access its directory, analyzer, similarity and maxFieldLength in the DocumentWriter constructor. A public getDirectory() method would also need to be added to IndexWriter for this to work. Similarly, two of SegmentMerger's constructor parameters could be replaced with an IndexWriter, the directory and boolean useCompoundFile. In SegmentMerge I would replace the directory parameter with IndexWriter. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: AW: Understanding TooManyClauses-Exception and Query-RAM-size
[EMAIL PROTECTED] wrote: What I really would like to see are some best practices or some advice from some users who are working with really large indices how they handle this situation, or why they don't have to care about it or maybe why I am completely missing the point ;-)) Many folks with really large indexes just don't permit things like wildcard and range searches. For example, Google supports no wildcards and has only recently added limited numeric range searching. Yahoo! supports neither. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Search has poor cpu utilization on a 4-CPU machine
Aviran wrote: First let me explain what I found out. I'm running Lucene on a 4 CPU server. While doing some stress tests I've noticed (by doing full thread dump) that searching threads are blocked on the method: public FieldInfo fieldInfo(int fieldNumber) This causes for a significant cpu idle time. What version of Lucene are you running? Also, can you please send the stack traces of the blocked threads, or at least a description of them? I'd be interested to see what context this happens in. In particular, which IndexReader and Searcher/Scorer/Weight methods does it happen under? I noticed that the class org.apache.lucene.index.FieldInfos uses private class members Vector byNumber and Hashtable byName, both of which are synchronized objects. By changing the Vector byNumber to ArrayList byNumber I was able to get 110% improvement in performance (number of searches per second). That's impressive! Good job finding a bottleneck! My question is: do the fields byNumber and byName have to be synchronized and what can happen if I'll change them to be ArrayList and HashMap which are not synchronized ? Can this corrupt the index or the integrity of the results? I think that is a safe change. FieldInfos is only modifed by DocumentWriter and SegmentMerger, and there is no possibility of other threads accessing those instances. Please submit a patch to the developer mailing list. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Search has poor cpu utilization on a 4-CPU machine
Aviran wrote: I use Lucene 1.4 final Here is the thread dump for one blocked thread (If you want a full thread dump for all threads I can do that too) Thanks. I think I get the point. I recently removed a synchronization point higher in the stack, so that now this one shows up! Whether or not you submit a patch, please file a bug report in Bugzilla with your proposed change, so that we don't lose track of this issue. Thanks, Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is Field.java final?
Kevin A. Burton wrote: I was going to create a new IDField class which just calls super( name, value, false, true, false) but noticed I was prevented because Field.java is final? You don't need to subclass to do this, just a static method somewhere. Why is this? I can't see any harm in making it non-final... Field and Document are not designed to be extensible. They are persisted in such a way that added methods are not available when the field is restored. In other words, when a field is read, it always constructs an instance of Field, not a subclass. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Field.java - STORED, NOT_STORED, etc...
Kevin A. Burton wrote: So I added a few constants to my class: new Field( name, value, NOT_STORED, INDEXED, NOT_TOKENIZED ); which IMO is a lot easier to maintain. Why not add these constants to Field.java: public static final boolean STORED = true; public static final boolean NOT_STORED = false; public static final boolean INDEXED = true; public static final boolean NOT_INDEXED = false; public static final boolean TOKENIZED = true; public static final boolean NOT_TOKENIZED = false; Of course you still have to remember the order but this becomes a lot easier to maintain. It would be best to get the compiler to check the order. If we change this, why not use type-safe enumerations: http://www.javapractices.com/Topic1.cjp The calls would look like: new Field(name, value, Stored.YES, Indexed.NO, Tokenized.YES); Stored could be implemented as the nested class: public final class Stored { private Stored() {} public static final Stored YES = new Stored(); public static final Stored NO = new Stored(); } and the compiler would check the order of arguments. How's that? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Field.java - STORED, NOT_STORED, etc...
Doug Cutting wrote: The calls would look like: new Field(name, value, Stored.YES, Indexed.NO, Tokenized.YES); Stored could be implemented as the nested class: public final class Stored { private Stored() {} public static final Stored YES = new Stored(); public static final Stored NO = new Stored(); } Actually, while we're at it, Indexed and Tokenized are confounded. A single entry would be better, something like: public final class Index { private Index() {} public static final Index NO = new Index(); public static final Index TOKENIZED = new Index(); public static final Index UN_TOKENIZED = new Index(); } then calls would look like just: new Field(name, value, Store.YES, Index.TOKENIZED); BTW, I think Stored would be better named Store too. BooleanQuery's required and prohibited flags could get the same treatment, with the addition of a nested class like: public final class Occur { private Occur() {} public static final Occur MUST_NOT = new Occur(); public static final Occur SHOULD = new Occur(); public static final Occur MUST = new Occur(); } and adding a boolean clause would look like: booleanQuery.add(new TermQuery(...), Occur.MUST); Then we can deprecate the old methods. Comments? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Kevin A. Burton wrote: With the typical handful of fields, one should never see more than hundreds of files. We only have 13 fields... Though to be honest I'm worried that even if I COULD do the optimize that it would run out of file handles. Optimization doesn't open all files at once. The most files that are ever opened by an IndexWriter is just: 4 + (5 + numIndexedFields) * (mergeFactor-1) This includes during optimization. However, when searching, an IndexReader must keep most files open. In particular, the maximum number of files an unoptimized, non-compound IndexReader can have open is: (5 + numIndexedFields) * (mergeFactor-1) * (log_base_mergeFactor(numDocs/minMergeDocs)) A compound IndexReader, on the other hand, should open at most, just: (mergeFactor-1) * (log_base_mergeFactor(numDocs/minMergeDocs)) An optimized, non-compound IndexReader will open just (5 + numIndexedFields) files. And an optimized, compound IndexReader should only keep one file open. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene shouldn't use java.io.tmpdir
Armbrust, Daniel C. wrote: The problem I ran into the other day with the new lock location is that Person A had started an index, ran into problems, erased the index and asked me to look at it. I tried to rebuild the index (in the same place on a Solaris machine) and found out that A) - her locks still existed, B) - I didn't have a clue where it put the locks on the Solaris machine (since no full path was given with the error - has this been fixed?) and C) - I didn't have permission to remove her locks. I think these problems have been fixed. When an index is created, all old locks are first removed. And when a lock cannot be obtained, it's full pathname is printed. Can you replicate this with 1.4-final? I think the locks should go back in the index, and we should fall back or give an option to put them elsewhere for the case of the read-only index. Changing the lock location is risky. Code which writes an index would not be required to alter the lock location, but code which reads it would be. This can easily lead to uncoordinated access. So it is best if the default lock location works well in most cases. We try to use a temporary directory writable by all users, and attempt to handle situations like those you describe above. Please tell me if you continue to have problems with locking. Thanks, Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Kevin A. Burton wrote: So is it possible to fix this index now? Can I just delete the most recent segment that was created? I can find this by ls -alt Sorry, I forgot to answer your question: this should work fine. I don't think you should even have to delete that segment. Also, to elaborate on my previous comment, a mergeFactor of 5000 not only delays the work until the end, but it also makes the disk workload more seek-dominated, which is not optimal. So I suspect a smaller merge factor, together with a larger minMergeDocs, will be much faster overall, including the final optimize(). Please tell us how it goes. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problem running lucene 1.4 demo on a solaris machine (permission denied)
MATL (Mats Lindberg) wrote: When i copied the lucene jar file to the solaris machine from the windows machine i used a ftp program. FTP probably mangled the file. You need to use FTP's binary mode. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Kevin A. Burton wrote: No... I changed the mergeFactor back to 10 as you suggested. Then I am confused about why it should take so long. Did you by chance set the IndexWriter.infoStream to something, so that it logs merges? If so, it would be interesting to see that output, especially the last entry. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene shouldn't use java.io.tmpdir
Kevin A. Burton wrote: This is why I think it makes more sense to use our own java.io.tmpdir to be on the safe side. I think the bug is that Tomcat changes java.io.tmpdir. I thought that the point of the system property java.io.tmpdir was to have a portable name for /tmp on unix, c:\windows\tmp on Windows, etc. Tomcat breaks that. So must Lucene have its own way of finding the platform-specific temporary directory that everyone can write to? Perhaps, but it seems a shame, since Java already has a standard mechanism for this, which Tomcat abuses... Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing help
John Wang wrote: Just for my education, can you maybe elaborate on using the implement an IndexReader that delivers a synthetic index approach? IndexReader is an abstract class. It has few data fields, and few non-static methods that are not implemented in terms of abstract methods. So, in effect, it is an interface. When Lucene indexes a token stream it creates a single-document index that is then merged with other single- and multi-document indexes to form an index that is searched. You could bypass the first step of this (indexing a token stream) by instead directly implementing all of IndexReader's abstract methods to return the same thing as the single-document index that Lucene would create. This would be marginally faster, as no Token objects would be created at all. But, since IndexReader has a lot of abstract methods, it would be a lot of work. I didn't really mean it as a practical suggestion. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Kevin A. Burton wrote: During an optimize I assume Lucene starts writing to a new segment and leaves all others in place until everything is done and THEN deletes them? That's correct. The only settings I uses are: targetIndex.mergeFactor=10; targetIndex.minMergeDocs=1000; the resulting index has 230k files in it :-/ Something sounds very wrong for there to be that many files. The maximum number of files should be around: (7 + numIndexedFields) * (mergeFactor-1) * (log_base_mergeFactor(numDocs/minMergeDocs)) With 14M documents, log_10(14M/1000) is 4, which gives, for you: (7 + numIndexedFields) * 36 = 230k 7*36 + numIndexedFields*36 = 230k numIndexedFields = (230k - 7*36) / 36 =~ 6k So you'd have to have around 6k unique field names to get 230k files. Or something else must be wrong. Are you running on win32, where file deletion can be difficult? With the typical handful of fields, one should never see more than hundreds of files. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Most efficient way to index 14M documents (out of memory/file handles)
A mergeFactor of 5000 is a bad idea. If you want to index faster, try increasing minMergeDocs instead. If you have lots of memory this can probably be 5000 or higher. Also, why do you optimize before you're done? That only slows things. Perhaps you have to do it because you've set mergeFactor to such an extreme value? I do not recommend a merge factor higher than 100. Doug Kevin A. Burton wrote: I'm trying to burn an index of 14M documents. I have two problems. 1. I have to run optimize() every 50k documents or I run out of file handles. this takes TIME and of course is linear to the size of the index so it just gets slower by the time I complete. It starts to crawl at about 3M documents. 2. I eventually will run out of memory in this configuration. I KNOW this has been covered before but for the life of me I can't find it in the archives, the FAQ or the wiki. I'm using an IndexWriter with a mergeFactor of 5k and then optimizing every 50k documents. Does it make sense to just create a new IndexWriter for every 50k docs and then do one big optimize() at the end? Kevin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Most efficient way to index 14M documents (out of memory/file handles)
Julien, Thanks for the excellent explanation. I think this thread points to a documentation problem. We should improve the javadoc for these parameters to make it easier for folks to In particular, the javadoc for mergeFactor should mention that very large values (100) are not recommended, since they can run into file handle limitations with FSDirectory. The maximum number of open files while merging is around mergeFactor * (5 + number of indexed fields). Perhaps mergeFactor should be tagged an Expert parameter to discourage folks playing with it, as it is such a common source of problems. The javadoc should instead encourage using minMergeDocs to increase indexing speed by using more memory. This parameter is unfortunately poorly named. It should really be called something like maxBufferedDocs. Doug Julien Nioche wrote: It is not surprising that you run out of file handles with such a large mergeFactor. Before trying more complex strategies involving RAMDirectories and/or splitting your indexation on several machines, I reckon you should try simple things like using a low mergeFactor (eg: 10) combined with a higher minMergeDocs (ex: 1000) and optimize only at the end of the process. By setting a higher value to minMergeDocs, you'll index and merge with a RAMDirectory. When the limit is reached (ex 1000) a segment is written in the FS. MergeFactor controls the number of segments to be merged, so when you have 10 segments on the FS (which is already 10x1000 docs), the IndexWriter will merge them all into a single segment. This is equivalent to an optimize I think. The process continues like that until it's finished. Combining theses parameters should be enough to achieve good performance. The good point of using minMergeDocs is that you make a heavy use of the RAMDirectory used by your IndexWriter (== fast) without having to be too careful with the RAM (which would be the case with RamDirectory). At the same time keeping your mergeFactor low limits the risks of too many handles problem. - Original Message - From: Kevin A. Burton [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, July 07, 2004 7:44 AM Subject: Most efficient way to index 14M documents (out of memory/file handles) I'm trying to burn an index of 14M documents. I have two problems. 1. I have to run optimize() every 50k documents or I run out of file handles. this takes TIME and of course is linear to the size of the index so it just gets slower by the time I complete. It starts to crawl at about 3M documents. 2. I eventually will run out of memory in this configuration. I KNOW this has been covered before but for the life of me I can't find it in the archives, the FAQ or the wiki. I'm using an IndexWriter with a mergeFactor of 5k and then optimizing every 50k documents. Does it make sense to just create a new IndexWriter for every 50k docs and then do one big optimize() at the end? Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing help
John Wang wrote: While lucene tokenizes the words in the document, it counts the frequency and figures out the position, we are trying to bypass this stage: For each document, I have a set of words with a know frequency, e.g. java (5), lucene (6) etc. (I don't care about the position, so it can always be 0.) What I can do now is to create a dummy document, e.g. java java java java java lucene lucene lucene lucene lucene and pass it to lucene. This seems hacky and cumbersome. Is there a better alternative? I browsed around in the source code, but couldn't find anything. Write an analyzer that returns terms with the appropriate distribution. For example: public class VectorTokenStream extends TokenStream { private int term; private int freq; public VectorTokenStream(String[] terms, int[] freqs) { this.terms = terms; this.freqs = freqs; } public Token next() { if (freq == 0) { term++; if (term = terms.length) return null; freq = freqs[term]; } freq--; return new Token(terms[term], 0, 0); } } Document doc = new Document(); doc.add(Field.Text(content, )); indexWriter.addDocument(doc, new Analyzer() { public TokenStream tokenStream(String field, Reader reader) { return new VectorTokenStream(new String[] {java,lucene}, new int[] {5,6}); } }); Too bad the Field class is final, otherwise I can derive from it and do something on that line... Extending Field would not help. That's why it's final. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Running OutOfMemory while optimizing and searching
What do your queries look like? The memory required for a query can be computed by the following equation: 1 Byte * Number of fields in your query * Number of docs in your index So if your query searches on all 50 fields of your 3.5 Million document index then each search would take about 175MB. If your 3-4 searches run concurrently then that's about 525MB to 700MB chewed up at once. That's not quite right. If you use the same IndexSearcher (or IndexReader) for all of the searches, then only 175MB are used. The arrays in question (the norms) are read-only and can be shared by all searches. In general, the amount of memory required is: 1 byte * Number of searchable fields in your index * Number of docs in your index plus 1k bytes * number of terms in query plus 1k bytes * number of phrase terms in query The latter are for i/o buffers. There are a few other things, but these are the major ones. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problems with lucene in multithreaded environment
Jayant Kumar wrote: Thanks for the patch. It helped in increasing the search speed to a good extent. Good. I'll commit it. Thanks for testing it. But when we tried to give about 100 queries in 10 seconds, then again we found that after about 15 seconds, the response time per query increased. This still sounds very slow to me. Is your index optimized? What JVM are you using? You might also consider ramping up your benchmark more slowly, to warm the filesystem's cache. So, when you first launch the server, give it a few queries at a lower rate, then, after those have completed, try a higher rate. We were able to simplify the searches further by consolidating the fields in the index but that resulted in increasing the index size to 2.5 GB as we required fields 2-5 and fields 1-7 in different searches. That will slow updates a bit, but searching should be faster. How about your range searches? Do you know how many terms they match? The easiest way to determine this might be to insert a print statement in RangeQuery.rewrite() that shows the query before it is returned. Our indexes are on the local disk therefor there is no network i/o involved. It does like file i/o is now your bottleneck. The traces below show that you're using the compound file format, which combines many files into one. When two threads try to read two logically different files (.prx and .frq below) they must sychronize when the compound format is used. But if your application did not use the compound format this synchronization would not be required. So you should try rebuilding your index with the compound format turned off. (The fastest way to do this is simply to add and/or delete a single document, then re-optimize the index with compound format turned off. This will cause the index to be re-written in non-compound format.) Is this on linux? If so, please try running 'iostat -x 1' while you perform your benchmark (iostat is installed by the 'sysstat' package). What percentage is the disk utilized (%util)? What is the percentage of idle CPU (%idle)? What is the rate of data that is read (rkB/s)? If things really are i/o bound then you might consider spreading the data over multiple disks, e.g., with lvm striping or a RAID controller. If you have a lot of RAM, then you could also consider moving certain files of the index onto a ramfs-based drive. For example, moving the .tis, .frq and .prx can greatly improve performance. Also, having these files in RAM means that the cache does not need to be warmed. Hope this helps! Doug Thread-23 prio=1 tid=0x08169f38 nid=0x2867 waiting for monitor entry [69bd4000..69bd48c8] at org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(CompoundFileReader.java:217) - waiting to lock 0x46f1b828 (a org.apache.lucene.store.FSInputStream) at org.apache.lucene.store.InputStream.refill(InputStream.java:158) at org.apache.lucene.store.InputStream.readByte(InputStream.java:43) at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83) at org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:58) Thread-22 prio=1 tid=0x08159f78 nid=0x2866 waiting for monitor entry [69b53000..69b538c8] at org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(CompoundFileReader.java:217) - waiting to lock 0x46f1b828 (a org.apache.lucene.store.FSInputStream) at org.apache.lucene.store.InputStream.refill(InputStream.java:158) at org.apache.lucene.store.InputStream.readByte(InputStream.java:43) at org.apache.lucene.store.InputStream.readVInt(InputStream.java:86) at org.apache.lucene.index.SegmentTermDocs.read(SegmentTermDocs.java:126) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problems with lucene in multithreaded environment
Jayant Kumar wrote: Please find enclosed jvmdump.txt which contains a dump of our search program after about 20 seconds of starting the program. Also enclosed is the file queries.txt which contains few sample search queries. Thanks for the data. This is exactly what I was looking for. Thread-14 prio=1 tid=0x080a7420 nid=0x468e waiting for monitor entry [4d61a000..4d61ac18] at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:112) - waiting to lock 0x44c95228 (a org.apache.lucene.index.TermInfosReader) Thread-12 prio=1 tid=0x080a58e0 nid=0x468e waiting for monitor entry [4d51a000..4d51ad18] at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:112) - waiting to lock 0x44c95228 (a org.apache.lucene.index.TermInfosReader) These are all stuck looking terms up in the dictionary (TermInfos). Things would be much faster if your queries didn't have so many terms. Query : ( ( ( ( ( FIELD1: proof OR FIELD2: proof OR FIELD3: proof OR FIELD4: proof OR FIELD5: proof OR FIELD6: proof OR FIELD7: proof ) AND ( FIELD1: george bush OR FIELD2: george bush OR FIELD3: george bush OR FIELD4: george bush OR FIELD5: george bush OR FIELD6: george bush OR FIELD7: george bush ) ) AND ( FIELD1: script OR FIELD2: script OR FIELD3: script OR FIELD4: script OR FIELD5: script OR FIELD6: script OR FIELD7: script ) ) AND ( ( FIELD1: san OR FIELD2: san OR FIELD3: san OR FIELD4: san OR FIELD5: san OR FIELD6: san OR FIELD7: san ) OR ( ( FIELD1: war OR FIELD2: war OR FIELD3: war OR FIELD4: war OR FIELD5: war OR FIELD6: war OR FIELD7: war ) OR ( ( FIELD1: gulf OR FIELD2: gulf OR FIELD3: gulf OR FIELD4: gulf OR FIELD5: gulf OR FIELD6: gulf OR FIELD7: gulf ) OR ( ( FIELD1: laden OR FIELD2: laden OR FIELD3: laden OR FIELD4: laden OR FIELD5: laden OR FIELD6: laden OR FIELD7: laden ) OR ( ( FIE LD1: ttouristeat OR FIELD2: ttouristeat OR FIELD3: ttouristeat OR FIELD4: ttouristeat OR FIELD5: ttouristeat OR FIELD6: ttouristeat OR FIELD7: ttouristeat ) OR ( ( FIELD1: pow OR FIELD2: pow OR FIELD3: pow OR FIELD4: pow OR FIELD5: pow OR FIELD6: pow OR FIELD7: pow ) OR ( FIELD1: bin OR FIELD2: bin OR FIELD3: bin OR FIELD4: bin OR FIELD5: bin OR FIELD6: bin OR FIELD7: bin ) ) ) ) ) ) ) ) ) AND RANGE: ([ 0800 TO 1100 ]) AND ( S_IDa: (7 OR 8 OR 9 OR 10 OR 11 OR 12 OR 13 OR 14 OR 15 OR 16 OR 17 ) or S_IDb: (2 ) ) All your queries look for terms in fields 1-7. If you instead combined the contents of fields 1-7 in a single field, and searched that field, then your searches would contain far fewer terms and be much faster. Also, I don't know how many terms your RANGE queries match, but that could also be introducing large numbers of terms which would slow things down too. But, still, you have identified a bottleneck: TermInfosReader caches a TermEnum and hence access to it must be synchronized. Caching the enum greatly speeds sequential access to terms, e.g., when merging, performing range or prefix queries, etc. Perhaps however the cache should be done through a ThreadLocal, giving each thread its own cache and obviating the need for synchronization... Please tell me if you are able to simplify your queries and if that speeds things. I'll look into a ThreadLocal-based solution too. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problems with lucene in multithreaded environment
Doug Cutting wrote: Please tell me if you are able to simplify your queries and if that speeds things. I'll look into a ThreadLocal-based solution too. I've attached a patch that should help with the thread contention, although I've not tested it extensively. I still don't fully understand why your searches are so slow, though. Are the indexes stored on the local disk of the machine? Indexes accessed over the network can be very slow. Anyway, give this patch a try. Also, if anyone else can try this and report back whether it makes multi-threaded searching faster, or anything else slower, or is buggy, that would be great. Thanks, Doug Index: src/java/org/apache/lucene/index/TermInfosReader.java === RCS file: /home/cvs/jakarta-lucene/src/java/org/apache/lucene/index/TermInfosReader.java,v retrieving revision 1.6 diff -u -u -r1.6 TermInfosReader.java --- src/java/org/apache/lucene/index/TermInfosReader.java 20 May 2004 11:23:53 - 1.6 +++ src/java/org/apache/lucene/index/TermInfosReader.java 4 Jun 2004 21:45:15 - @@ -29,7 +29,8 @@ private String segment; private FieldInfos fieldInfos; - private SegmentTermEnum enumerator; + private ThreadLocal enumerators = new ThreadLocal(); + private SegmentTermEnum origEnum; private long size; TermInfosReader(Directory dir, String seg, FieldInfos fis) @@ -38,19 +39,19 @@ segment = seg; fieldInfos = fis; -enumerator = new SegmentTermEnum(directory.openFile(segment + .tis), - fieldInfos, false); -size = enumerator.size; +origEnum = new SegmentTermEnum(directory.openFile(segment + .tis), + fieldInfos, false); +size = origEnum.size; readIndex(); } public int getSkipInterval() { -return enumerator.skipInterval; +return origEnum.skipInterval; } final void close() throws IOException { -if (enumerator != null) - enumerator.close(); +if (origEnum != null) + origEnum.close(); } /** Returns the number of term/value pairs in the set. */ @@ -58,6 +59,15 @@ return size; } + private SegmentTermEnum getEnum() { +SegmentTermEnum enum = (SegmentTermEnum)enumerators.get(); +if (enum == null) { + enum = terms(); + enumerators.set(enum); +} +return enum; + } + Term[] indexTerms = null; TermInfo[] indexInfos; long[] indexPointers; @@ -102,16 +112,17 @@ } private final void seekEnum(int indexOffset) throws IOException { -enumerator.seek(indexPointers[indexOffset], - (indexOffset * enumerator.indexInterval) - 1, +getEnum().seek(indexPointers[indexOffset], + (indexOffset * getEnum().indexInterval) - 1, indexTerms[indexOffset], indexInfos[indexOffset]); } /** Returns the TermInfo for a Term in the set, or null. */ - final synchronized TermInfo get(Term term) throws IOException { + TermInfo get(Term term) throws IOException { if (size == 0) return null; -// optimize sequential access: first try scanning cached enumerator w/o seeking +// optimize sequential access: first try scanning cached enum w/o seeking +SegmentTermEnum enumerator = getEnum(); if (enumerator.term() != null // term is at or past current ((enumerator.prev != null term.compareTo(enumerator.prev) 0) || term.compareTo(enumerator.term()) = 0)) { @@ -128,6 +139,7 @@ /** Scans within block for matching term. */ private final TermInfo scanEnum(Term term) throws IOException { +SegmentTermEnum enumerator = getEnum(); while (term.compareTo(enumerator.term()) 0 enumerator.next()) {} if (enumerator.term() != null term.compareTo(enumerator.term()) == 0) return enumerator.termInfo(); @@ -136,10 +148,12 @@ } /** Returns the nth term in the set. */ - final synchronized Term get(int position) throws IOException { + final Term get(int position) throws IOException { if (size == 0) return null; -if (enumerator != null enumerator.term() != null position = enumerator.position +SegmentTermEnum enumerator = getEnum(); +if (enumerator != null enumerator.term() != null +position = enumerator.position position (enumerator.position + enumerator.indexInterval)) return scanEnum(position); // can avoid seek @@ -148,6 +162,7 @@ } private final Term scanEnum(int position) throws IOException { +SegmentTermEnum enumerator = getEnum(); while(enumerator.position position) if (!enumerator.next()) return null; @@ -156,12 +171,13 @@ } /** Returns the position of a Term in the set or -1. */ - final synchronized long getPosition(Term term) throws IOException { + final long getPosition(Term term) throws IOException { if (size == 0) return -1; int indexOffset = getIndexOffset(term); seekEnum(indexOffset); +SegmentTermEnum enumerator = getEnum(); while
Re: problems with lucene in multithreaded environment
Jayant Kumar wrote: We recently tested lucene with an index size of 2 GB which has about 1,500,000 documents, each document having about 25 fields. The frequency of search was about 20 queries per second. This resulted in an average response time of about 20 seconds approx per search. That sounds slow, unless your queries are very complex. What are your queries like? What we observed was that lucene queues the queries and does not release them until the results are found. so the queries that have come in later take up about 500 seconds. Please let us know whether there is a technique to optimize lucene in such circumstances. Multiple queries executed from different threads using a single searcher should not queue, but should run in parallel. A technique to find out where threads are queueing is to get a thread dump and see where all of the threads are stuck. In Solaris and Linux, sending the JVM a SIGQUIT will give a thread dump. On Windows, use Control-Break. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory usage
James Dunn wrote: Also I search across about 50 fields but I don't use wildcard or range queries. Lucene uses one byte of RAM per document per searched field, to hold the normalization values. So if you search a 10M document collection with 50 fields, then you'll end up using 500MB of RAM. If you're using unanalyzed fields, then an easy workaround to reduce the number of fields is to combine many in a single field. So, instead of, e.g., using an f1 field with value abc, and an f2 field with value efg, use a single field named f with values 1_abc and 2_efg. We could optimize this in Lucene. If no values of an indexed field are analyzed, then we could store no norms for the field and hence read none into memory. This wouldn't be too hard to implement... Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory usage
It is cached by the IndexReader and lives until the index reader is garbage collected. 50-70 searchable fields is a *lot*. How many are analyzed text, and how many are simply keywords? Doug James Dunn wrote: Doug, Thanks! I just asked a question regarding how to calculate the memory requirements for a search. Does this memory only get used only during the search operation itself, or is it referenced by the Hits object or anything else after the actual search completes? Thanks again, Jim --- Doug Cutting [EMAIL PROTECTED] wrote: James Dunn wrote: Also I search across about 50 fields but I don't use wildcard or range queries. Lucene uses one byte of RAM per document per searched field, to hold the normalization values. So if you search a 10M document collection with 50 fields, then you'll end up using 500MB of RAM. If you're using unanalyzed fields, then an easy workaround to reduce the number of fields is to combine many in a single field. So, instead of, e.g., using an f1 field with value abc, and an f2 field with value efg, use a single field named f with values 1_abc and 2_efg. We could optimize this in Lucene. If no values of an indexed field are analyzed, then we could store no norms for the field and hence read none into memory. This wouldn't be too hard to implement... Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]