Re: knowing which field contributed the search result
Hi David: Can you further explain which calls specically would solve my problem? Thanks -John On Mon, 21 Feb 2005 12:20:15 -0800, David Spencer [EMAIL PROTECTED] wrote: John Wang wrote: Anyone has any thoughts on this? Does this help? http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Searchable.html#explain(org.apache.lucene.search.Query,%20int) Thanks -John On Wed, 16 Feb 2005 14:39:52 -0800, John Wang [EMAIL PROTECTED] wrote: Hi: Is there way to find out given a hit from a search, find out which fields contributed to the hit? e.g. If my search for: contents1=brown fox OR contents2=black bear can the document founded by this query also have information on whether it was found via contents1 or contents2 or both. Thanks -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: knowing which field contributed the search result
Anyone has any thoughts on this? Thanks -John On Wed, 16 Feb 2005 14:39:52 -0800, John Wang [EMAIL PROTECTED] wrote: Hi: Is there way to find out given a hit from a search, find out which fields contributed to the hit? e.g. If my search for: contents1=brown fox OR contents2=black bear can the document founded by this query also have information on whether it was found via contents1 or contents2 or both. Thanks -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
knowing which field contributed the search result
Hi: Is there way to find out given a hit from a search, find out which fields contributed to the hit? e.g. If my search for: contents1=brown fox OR contents2=black bear can the document founded by this query also have information on whether it was found via contents1 or contents2 or both. Thanks -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: google mini? who needs it when Lucene is there
I think Google mini also includes crawling and a server wrapper. So it is not entirely an 1-to-1 comparison. Of couse extending lucene to have those features are not at all difficult anyway. -John On Thu, 27 Jan 2005 16:04:54 -0800 (PST), Xiaohong Yang (Sharon) [EMAIL PROTECTED] wrote: Hi, I agree that Google mini is quite expensive. It might be similar to the desktop version in quality. Anyone knows google's ratio of index to text? Is it true that Lucene's index is about 500 times the original text size (not including image size)? I don't have one installed, so I cannot measure. Best, Sharon jian chen [EMAIL PROTECTED] wrote: Hi, I was searching using google and just found that there was a new feature called google mini. Initially I thought it was another free service for small companies. Then I realized that it costs quite some money ($4,995) for the hardware and software. (I guess the proprietary software costs a whole lot more than actual hardware.) The nice feature is that, you can only index up to 50,000 documents with this price. If you need to index more, sorry, send in the check... It seems to me that any small biz will be ripped off if they install this google mini thing, compared to using Lucene to implement a easy to use search software, which could search up to whatever number of documents you could image. I hope the lucene project could get exposed more to the enterprise so that people know that they have not only cheaper but more importantly, BETTER alternatives. Jian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
lucene2.0 and transaction support
Hi: When is lucene 2.0 scheduled to be released? Is there a javadoc somewhere so we can check out the new APIs? Is there a plan to add transaction support into lucene? This is something we need and if we do implement it ourselves, is it too large of a change for a patch? Thanks -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: reading fields selectively
Thanks guys for the info! After looking at the patch code I have two problems: 1) The patch implementation doesn't help with performance. It still reads the data for every field in the document. Just not storing all of them. So this implementation helps if there are memory restrictions, but not if you are after performance. 2) We are bundling Lucene in our application, we are trying very hard not having to change Lucene code and thus divert from the Lucene code base. This patch implementation requires you to make changes to SegmentReader.java. I am hoping not having to do that. Any ideas? Thanks -John On Fri, 7 Jan 2005 08:59:25 + (GMT), mark harwood [EMAIL PROTECTED] wrote: There is no API for this, but I recall somebody talking about adding support for this a few months back See http://marc.theaimsgroup.com/?l=lucene-devm=109485996612177w=2 This implementation was working on a version of Lucene before compression was introduced so things may have changed a little. Cheers, Mark ___ ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: setting Similarity at search time
Hi Chuck: Trying to follow up on this thread. Do you know if this feature will be incorporated in the next Lucene release? How would someone find out which patches will go into the next release? Thanks -John On Mon, 15 Nov 2004 13:05:36 -0800, Chuck Williams [EMAIL PROTECTED] wrote: Take a look at this: http://issues.apache.org/bugzilla/show_bug.cgi?id=31841 Not my initial patch, but the latest patch from Wolf Siberski. I haven't used it yet, but it looks like what you are looking for, and something I want to use too. Chuck -Original Message- From: Ken McCracken [mailto:[EMAIL PROTECTED] Sent: Monday, November 15, 2004 11:31 AM To: Lucene Users List Subject: setting Similarity at search time Hi, Is there a way to set the Similarity at search(...) time, rather than just setting it on the (Index)Searcher object itself? I'd like to be able to specify different similarities in different threads searching concurrently, using the same IndexSearcher instance. In my use case, the choice of Similarity is a parameter of the search request, and hence may be different for each request. Can such a method be added to override the search(...) method? Thanks, -Ken - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: multi-threaded thru-put in lucene
I actually ran a few tests. But seeing similar behaviors. After removing all the possible variations, this is what I used: 1 Index, doccount is 15,000. Using FSDirectory, e.g. new IndexSearcher(String path), by default I think it uses FSDirectory. each thread is doing 100 iterations of search, e.g. for (int i=0;i100;++i){ idxSearcher.search(q); } for each thread and each iteration, I am using the same query. I am timing them the following way: long start=System.currenTimeInMillis(); for (int i =0;ithreadCount;++i){ thread[i].start(); } for (int i=0;ithreadCount;++i){ thread[i].join(); } long duration=System.currenTimeInMillis()-start; duration numbers I am getting are: 1 thread: 445 ms. 2 threads: 870 ms. 5 threads: 2200 ms. Pretty much the same numbers you'd get if you are running them sequentially. Any ideas? Am I doing something wrong? Thanks advance for all your help -John On Thu, 6 Jan 2005 00:06:09 -0800 (PST), Chris Hostetter [EMAIL PROTECTED] wrote: : This is what we found: : : 1 thread, search takes 20 ms. : : 2 threads, search takes 40 ms. : : 5 threads, search takes 100 ms. how big is your index? What are the term frequencies like in your index? how many differnt queries did you try? what was the structure of your query objects like? were you using a RAMDirectory or an FSDirectory? what hardware were you running on? Is your test application small enough that you can post it to the list? I haven't done a lot of PMA testing of Lucene, but from what limited testing i have done I'm a little suprised at those numbers, you'd get results just as good if you ran the queries sequentially. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: multi-threaded thru-put in lucene
Is the operation IndexSearcher.search I/O or CPU bound if I am doing 100's of searches on the same query? Thanks -John On Thu, 06 Jan 2005 10:31:49 -0800, Doug Cutting [EMAIL PROTECTED] wrote: John Wang wrote: 1 thread: 445 ms. 2 threads: 870 ms. 5 threads: 2200 ms. Pretty much the same numbers you'd get if you are running them sequentially. Any ideas? Am I doing something wrong? If you're performing compute-bound work on a single-processor machine then threading should give you no better performance than sequential, perhaps a bit worse. If you're performing io-bound work on a single-disk machine then threading should again provide no improvement. If the task is evenly compute and i/o bound then you could achieve at best a 2x speedup on a single CPU system with a single disk. If you're compute-bound on an N-CPU system then threading should optimally be able to provide a factor of N speedup. Java's scheduling of compute-bound theads when no threads call Thread.sleep() can also be very unfair. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: multi-threaded thru-put in lucene
Thanks Doug! You are right, by adding a Thread.sleep() helped greatly. Mysteries of Java... Another Java threading question. With 1 thread, iterations of 100 searches, it took about 850 ms. by adding a Thread.sleep(10) in the loop. It is taking about 2200 ms. So there is 2200 - 1850 = 350 ms unaccounted for. Is that due to thread scheduling/context switching? Thanks -John On Thu, 6 Jan 2005 10:36:12 -0800, John Wang [EMAIL PROTECTED] wrote: Is the operation IndexSearcher.search I/O or CPU bound if I am doing 100's of searches on the same query? Thanks -John On Thu, 06 Jan 2005 10:31:49 -0800, Doug Cutting [EMAIL PROTECTED] wrote: John Wang wrote: 1 thread: 445 ms. 2 threads: 870 ms. 5 threads: 2200 ms. Pretty much the same numbers you'd get if you are running them sequentially. Any ideas? Am I doing something wrong? If you're performing compute-bound work on a single-processor machine then threading should give you no better performance than sequential, perhaps a bit worse. If you're performing io-bound work on a single-disk machine then threading should again provide no improvement. If the task is evenly compute and i/o bound then you could achieve at best a 2x speedup on a single CPU system with a single disk. If you're compute-bound on an N-CPU system then threading should optimally be able to provide a factor of N speedup. Java's scheduling of compute-bound theads when no threads call Thread.sleep() can also be very unfair. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
reading fields selectively
Hi: Is there some way to read only 1 field value from an index given a docID? From the current API, in order to get a field from given a docID, I would call: IndexSearcher.document(docID) which in turn reads in all fields from the disk. Here is my problem: After the search, I have a set of docIDs. For each document, I have a unique string identifier. At this point I only need these identifiers but with the above API, I am forced to read the entire row of fields for each document in the search result, which in my case can be very large. Is there an alternative? I am thinking more on the lines of a call: Field[] getFields(int docID,String fieldName); Thanks -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
multi-threaded thru-put in lucene
Hi folks: We are trying to measure thru-put lucene in a multi-threaded environment. This is what we found: 1 thread, search takes 20 ms. 2 threads, search takes 40 ms. 5 threads, search takes 100 ms. Seems like under a multi-threaded scenario, thru-put isn't good, performance is not any better than that of 1 thread. I tried to share an IndexSearcher amongst all threads as well as having an IndexSearcher per thread. Both yield same numbers. Is this consistent with what you'd expect? Thanks -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Remotely Index
one way is to create a reader from a URL to your file: (Assuming the file is hosted somewhere reachable by an URL) Reader r=new InputStreamReader(url.getInputStream()); Document doc=new Document(); doc.addField(Field.Keyword(url,url.toString())); doc.addField(Field.Text(contents,r)); iw.addDocument(doc); -John On Thu, 16 Dec 2004 16:07:57 +0530, Natarajan.T [EMAIL PROTECTED] wrote: Hi All, How to Index remotely? For example I have a some documents in machine A and lucene Indexing and searching server in machine B. How can do Index... Regards, Natarajan. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
File locking using java.nio.channels.FileLock
Hi: When is Lucene planning on moving toward java 1.4+? I see there are some problems caused from the current lock file implementation, e.g. Bug# 32171. The problems would be easily fixed by using the java.nio.channels.FileLock object. Thanks -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: finalize delete without optimize
Hi Otis: Thanks for you reply. I am looking for more of an API call than a tool. e.g. IndexWriter.finalizeDelete() If I implement this, how would I go about submitting a patch? thanks -John On Mon, 13 Dec 2004 22:24:12 -0800 (PST), Otis Gospodnetic [EMAIL PROTECTED] wrote: Hello John, I believe you didn't get any replies to this. What you are describing cannot be done using the public, but maaay (no source code on this machine, so I can't double-check that) be doable if you use some of the 'internal' methods. I don't have the need for this, but others might, so it may be worth developing a tool that purges Documents marked as deleted without the expensive segment merging, iff that is possible. If you put this tool under the approprite org.apache.lucene... package, you'll get access to 'internal' methods, of course. If you end up creating this, we could stick it in the Sandbox, where we should really create a new section for handy command-line tools that manipulate the index. Otis --- John Wang [EMAIL PROTECTED] wrote: Hi: Is there a way to finalize delete, e.g. actually remove them from the segments and make sure the docIDs are contiguous again. The only explicit way to do this is by calling IndexWriter.optmize(). But this call does a lot more (also merges all the segments), hence is very expensive. Is there a way to simply just finalize the deletes without having to merge all the segments? If not, I'd be glad to submit an implementation of this feature if the Lucene devs agree this is useful. Thanks -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Vs Ixiasoft
I thought Lucene implements the Boolean model. -John On Thu, 9 Dec 2004 00:19:21 +0100, Nicolas Maisonneuve [EMAIL PROTECTED] wrote: hi, think first of the relevance of the model in this 2 search engine for XML document retrieval. Lucene is classic fulltext search engine using the vector space model. this model is efficient for indexing no structred document (like plain text file ) and not made for structured document like XML. there is a XML demo of lucene sandbox but it's not really very efficient because it doesn't take advantage of the document strucutre in the indexing and the ranking model, so it lose semantic information and relevance. i don't know Ixiasoft, check the information to see how it index and rank XML document. nicolas On Wed, 8 Dec 2004 14:20:45 -0500, Praveen Peddi [EMAIL PROTECTED] wrote: Does anyone know about Ixiasoft server. Its a xml repository/search engine. If anyone knows about it, does he/she also know how it is compared to Lucene? Which is fast? Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- The Leader in Enterprise Content Integration - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
finalize delete without optimize
Hi: Is there a way to finalize delete, e.g. actually remove them from the segments and make sure the docIDs are contiguous again. The only explicit way to do this is by calling IndexWriter.optmize(). But this call does a lot more (also merges all the segments), hence is very expensive. Is there a way to simply just finalize the deletes without having to merge all the segments? If not, I'd be glad to submit an implementation of this feature if the Lucene devs agree this is useful. Thanks -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Recommended values for mergeFactor, minMergeDocs, maxMergeDocs
We've found something interesting about mergeFactors. We are indexing a million documents with a batch of 1000. We first set the mergeFactor to 1000. What we found is at every 10th commit, we see a significant spike in indexing time. The reason is that the indexer is trying to merge the segments every 10th commit, e.g 10*mergeFactor, since the mergeFactor is large, the merge time is also long. The example given in the previous email thread indexes identical documents, merge time is very fast since no new terms are introduced as indexing proceeds. Hence it may hide this overhead. We found mergeFactor=100 worked well for our application. Cheers -John On Fri, 3 Dec 2004 16:38:34 -0800 (PST), Otis Gospodnetic [EMAIL PROTECTED] wrote: In my experiments with mergeFactor I found the point of diminishing/no returns. If I remember correctly, I hit the limit at mergeFactor of 50. But here is something from Lucene in Action that you can use to play with various index tuning factors and see their effect on indexing performance. It's simple, and if you want to test all 3 of your scenarios, you will have to modify it. package lia.indexing; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; /** * */ public class IndexTuningDemo { public static void main(String[] args) throws Exception { int docsInIndex = Integer.parseInt(args[0]); // create an index called 'index-dir' in a temp directory Directory dir = FSDirectory.getDirectory( System.getProperty(java.io.tmpdir, tmp) + System.getProperty(file.separator) + index-dir, true); Analyzer analyzer = new SimpleAnalyzer(); IndexWriter writer = new IndexWriter(dir, analyzer, true); // set variables that affect speed of indexing writer.mergeFactor = Integer.parseInt(args[1]); writer.maxMergeDocs = Integer.parseInt(args[2]); writer.minMergeDocs = Integer.parseInt(args[3]); writer.infoStream= System.out; System.out.println(Merge factor:+ writer.mergeFactor); System.out.println(Max merge docs: + writer.maxMergeDocs); System.out.println(Min merge docs: + writer.minMergeDocs); long start = System.currentTimeMillis(); for (int i = 0; i docsInIndex; i++) { Document doc = new Document(); doc.add(Field.Text(fieldname, Bibamus)); writer.addDocument(doc); } writer.close(); long stop = System.currentTimeMillis(); System.out.println(Time: + (stop - start) + ms); } } Otis --- Chuck Williams [EMAIL PROTECTED] wrote: I'm wondering what values of mergeFactor, minMergeDocs and maxMergeDocs people have found to yield the best performance for different configurations. Is there a repository of this information anywhere? I've got about 30k documents and have 3 indexing scenarios: 1. Full indexing and optimize 2. Incremental indexing and optimize 3. Parallel incremental indexing without optimize Search performance is critical. For both cases 1 and 2, I'd like the fastest possible indexing time. For case 3, I'd like minimal pauses and no noticeable degradation in search performance. Based on reading the code (including the javadocs comments), I'm thinking of values along these lines: mergeFactor: 1000 during Full indexing, and during optimize (for both cases 1 and 2); 10 during incremental indexing (cases 2 and 3) minMergeDocs: 1000 during Full indexing, 10 during incremental indexing maxMergeDocs: Integer.MAX_VALUE during full indexing, 1000 during incremental indexing Do these values seem reasonable? Are there better settings before I start experimenting? Since mergeFactor is used in both addDocument() and optimize(), I'm thinking of using two different values in case 2: 10 during the incremental indexing, and then 1000 during the optimize. Is changing the value like this going to cause a problem? Thanks for any advice, Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: URGENT: Help indexing large document set
Hi Chuck: Thanks for your help and the info. By some experimentation, I found when calling FSWriter.addIndex(ramDirectory), it is actually performing a merge with the existing index. So doing 2000 batches of 500, when the index grows after each batch, the time to do the merge increases. I guess in this implementation, doing it this way is not optimal. Thanks -John On Sat, 27 Nov 2004 13:14:31 -0800, Chuck Williams [EMAIL PROTECTED] wrote: Hi John, I don't use a RamDirectory and so don't have the answer for you. There have been a number of messages about RamDirectory performance on lucene-user, including some reported benchmarks. Some people have reported a significant benefit from RamDirectory's, but most others have seen little or no benefit. I'm not sure which factors indicate the nature or magnitude of impact. You sent the message below just to me -- you might want to post a question on lucene-user. I've included a couple messages below on the subject that I saved. Chuck Included messages: -Original Message- From: Jonathan Hager [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 24, 2004 2:27 PM To: Lucene Users List Subject: Re: Index in RAM - is it realy worthy? When comparing RAMDirectory and FSDirectory it is important to mention what OS you are using. When using linux it will cache the most recent disk access in memory. Here is a good article that describes its strategy: http://forums.gentoo.org/viewtopic.php?t=175419 The 2% difference you are seeing is the memory copy. With other OSes you may see a speed up when using the RAMDirectory, because not all OSes contain a disk cache in memory and must access the disk to read the index. Another consideration is there is currently a 2GB limitation with the size of the RAMDirectory. Indexes over 2GB causes a overflow in the int used to create the buffer. [see int len = (int) is.length(); in RamDirectory] I ended up using RAM directory for a very different reason. The index is 1 to 2MB and is rebuilt every few hours. It takes 3 to 4 minutes to query the database and rebuild the index. But the search should be available 100% of the time. Since the index is so small I do the following: on server startup: - look for semaphore, if it is there delete the index - if there is no index, build it to FSdirectory - load the index from FSDirectory into RAMDirectory on reindex: - create semaphore - rebuild index to FSDirectory - delete semaphore - load index from FSDirecttory into RAMDirectory to search: - search the RAMDirectory RAMDirectory could be replaced by a regular FSDirectory, but it seemed silly to copy the index from disk to disk, when it ultimately needs to be in memory. FSDirectory could be replaced by a RAMDirectory, but this means that it would take the server 3 to 4 minutes longer to startup every time. By persisting the index, this time would only be necessary if indexing was interrupted. Jonathan On Mon, 22 Nov 2004 12:39:07 -0800, Kevin A. Burton [EMAIL PROTECTED] wrote: Otis Gospodnetic wrote: For the Lucene book I wrote some test cases that compare FSDirectory and RAMDirectory. What I found was that with certain settings FSDirectory was almost as fast as RAMDirectory. Personally, I would push FSDirectory and hope that the OS and the Filesystem do their share of work and caching for me before looking for ways to optimize my code. Yes... I performed the same benchmark and in my situation RAMDirectory for searches was about 2% slower. I'm willing to bet that it has to do with the fact that its a Hashtable and not a HashMap (which isn't synchronized). Also adding a constructor for the term size could make loading a RAMDirectory faster since you could prevent rehash. If you're on a modern machine your filesystme cache will end up buffering your disk anyway which I'm sure was happening in my situation. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -Original Message- From: John Wang [mailto:[EMAIL
Re: URGENT: Help indexing large document set
Thanks Paul! Using your suggestion, I have changed the update check code to use only the indexReader: try { localReader = IndexReader.open(path); while (keyIter.hasNext()) { key = (String) keyIter.next(); term = new Term(key, key); TermDocs tDocs = localReader.termDocs(term); if (tDocs != null) { try { while (tDocs.next()) { localReader.delete(tDocs.doc()); } } finally { tDocs.close(); } } } } finally { if (localReader != null) { localReader.close(); } } Unfortunately it didn't seem to make any dramatic difference. I also see the CPU is only 30-50% busy, so I am guessing it's spending a lot of time in IO. Anyway of making the CPU work harder? Is batch size of 500 too small for 1 million documents? Currently I am seeing a linear speed degredation of 0.3 milliseconds per document. Thanks -John On Wed, 24 Nov 2004 09:05:39 +0100, Paul Elschot [EMAIL PROTECTED] wrote: On Wednesday 24 November 2004 00:37, John Wang wrote: Hi: I am trying to index 1M documents, with batches of 500 documents. Each document has an unique text key, which is added as a Field.KeyWord(name,value). For each batch of 500, I need to make sure I am not adding a document with a key that is already in the current index. To do this, I am calling IndexSearcher.docFreq for each document and delete the document currently in the index with the same key: while (keyIter.hasNext()) { String objectID = (String) keyIter.next(); term = new Term(key, objectID); int count = localSearcher.docFreq(term); To speed this up a bit make sure that the iterator gives the terms in sorted order. I'd use an index reader instead of a searcher, but that will probably not make a difference. Adding the documents can be done with multiple threads. Last time I checked that, there was a moderate speed up using three threads instead of one on a single CPU machine. Tuning the values of minMergeDocs and maxMergeDocs may also help to increase performance of adding documents. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Too many open files issue
I have also seen this problem. In the Lucene code, I don't see where the reader speicified when creating a field is closed. That holds on to the file. I am looking at DocumentWriter.invertDocument() Thanks -John On Mon, 22 Nov 2004 16:21:35 -0600, Chris Lamprecht [EMAIL PROTECTED] wrote: A useful resource for increasing the number of file handles on various operating systems is the Volano Report: http://www.volano.com/report/ I had requested help on an issue we have been facing with the Too many open files Exception garbling the search indexes and crashing the search on the web site. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
URGENT: Help indexing large document set
Hi: I am trying to index 1M documents, with batches of 500 documents. Each document has an unique text key, which is added as a Field.KeyWord(name,value). For each batch of 500, I need to make sure I am not adding a document with a key that is already in the current index. To do this, I am calling IndexSearcher.docFreq for each document and delete the document currently in the index with the same key: while (keyIter.hasNext()) { String objectID = (String) keyIter.next(); term = new Term(key, objectID); int count = localSearcher.docFreq(term); if (count != 0) { localReader.delete(term); } } Then I proceed with adding the documents. This turns out to be extremely expensive, I looked into the code and I see in TermInfosReader.get(Term term) it is doing a linear look up for each term. So as the index grows, the above operation degrades at a linear rate. So for each commit, we are doing a docFreq for 500 documents. I also tried to create a BooleanQuery composed of 500 TermQueries and do 1 search for each batch, and the performance didn't get better. And if the batch size increases to say 50,000, creating a BooleanQuery composed of 50,000 TermQuery instances may introduce huge memory costs. Is there a better way to do this? Can TermInfosReader.get(Term term) be optimized to do a binary lookup instead of a linear walk? Of course that depends on whether the terms are stored in sorted order, are they? This is very urgent, thanks in advance for all your help. -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: URGENT: Help indexing large document set
Thanks Chuck! I missed the call: getIndexOffset. I am profiling it again to pin-point where the performance problem is. -John On Tue, 23 Nov 2004 16:13:22 -0800, Chuck Williams [EMAIL PROTECTED] wrote: Are you sure you have a performance problem with TermInfosReader.get(Term)? It looks to me like it scans sequentially only within a small buffer window (of size SegmentTermEnum.indexInterval) and that it uses binary search otherwise. See TermInfosReader.getIndexOffset(Term). Chuck -Original Message- From: John Wang [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 23, 2004 3:38 PM To: [EMAIL PROTECTED] Subject: URGENT: Help indexing large document set Hi: I am trying to index 1M documents, with batches of 500 documents. Each document has an unique text key, which is added as a Field.KeyWord(name,value). For each batch of 500, I need to make sure I am not adding a document with a key that is already in the current index. To do this, I am calling IndexSearcher.docFreq for each document and delete the document currently in the index with the same key: while (keyIter.hasNext()) { String objectID = (String) keyIter.next(); term = new Term(key, objectID); int count = localSearcher.docFreq(term); if (count != 0) { localReader.delete(term); } } Then I proceed with adding the documents. This turns out to be extremely expensive, I looked into the code and I see in TermInfosReader.get(Term term) it is doing a linear look up for each term. So as the index grows, the above operation degrades at a linear rate. So for each commit, we are doing a docFreq for 500 documents. I also tried to create a BooleanQuery composed of 500 TermQueries and do 1 search for each batch, and the performance didn't get better. And if the batch size increases to say 50,000, creating a BooleanQuery composed of 50,000 TermQuery instances may introduce huge memory costs. Is there a better way to do this? Can TermInfosReader.get(Term term) be optimized to do a binary lookup instead of a linear walk? Of course that depends on whether the terms are stored in sorted order, are they? This is very urgent, thanks in advance for all your help. -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
indexing benchmark
Hi folks: Is there an indexing benchmark somewhere? I see a search benchmark on the lucene home site. Thanks -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index in RAM - is it realy worthy?
In my test, I have 12900 documents. Each document is small, a few discreet fields (KeyWord type) and 1 Text field containing only 1 sentence. with both mergeFactor and maxMergeDocs being 1000 using RamDirectory, the indexing job took about 9.2 seconds not using RamDirectory, the indexing job took about 122 seconds. I am not calling optimize. This is on windows Xp running java 1.5. Is there something very wrong or different in my setup to cause such a big different? Thanks -John On Mon, 22 Nov 2004 09:23:40 -0800 (PST), Otis Gospodnetic [EMAIL PROTECTED] wrote: For the Lucene book I wrote some test cases that compare FSDirectory and RAMDirectory. What I found was that with certain settings FSDirectory was almost as fast as RAMDirectory. Personally, I would push FSDirectory and hope that the OS and the Filesystem do their share of work and caching for me before looking for ways to optimize my code. Otis --- [EMAIL PROTECTED] wrote: I did following test: I created the RAM folder on my Red Hat box and copied c. 1Gb of indexes there. I expected the queries to run much quicker. In reality it was even sometimes slower(sic!) Lucene has it's own RAM disk functionality. If I implement it, would it bring any benefits? Thanks in advance J. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
lucene file locking question
Hi folks: My application builds a super-index around the lucene index, e.g. stores some additional information outside of lucene. I am using my own locking outside of the lucene index via FileLock object in the jdk1.4 nio package. My code does the following: FileLock lock=null; try{ lock=myLockFileChannel.lock(); indexing into lucene; indexing additional information; } finally{ try{ commit lucene index by closing the IndexWriter instance. } finally{ if (lock!=null){ lock.release(); } } } Now here is the weird thing, say I terminate the process in the middle of indexing, and run the program again, I would get a Lock obtain time out exception, as long as you delete the stale lock file, the index remains uncorrupted. However, if I turn lucene file lock off since I have a lock outside it anyways, (by doing: static{ System.setProperty(disableLuceneLocks,true); } ) and do the same thing. Instead I get an unrecoverable corrupted index. Does lucene lock really guarentee index integrity under this kind of abuse or am I just getting lucky? If so, can someone shine some light on how? Thanks in advance -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene cutomized indexing
Hi Eric and Grant: Thanks for the replies and this is certainly encouraging. As suggested, I will post furthere such discussions to the dev list. Thanks -John On Tue, 20 Jul 2004 15:37:35 -0400, Grant Ingersoll [EMAIL PROTECTED] wrote: It seems to me the answer to this is not necessarily to open up the API, but to provide a mechanism for adding Writers and Readers to the indexing/searching process at the application level. These readers and writers could be passed to Lucene and used to read and write to separate files (thus, not harming the index file format). They could be used to read/write an arbitrary amount of metadata at the term, document and/or index level w/o affecting the core Lucene index. Furthermore, previous versions could still work b/c they would just ignore the new files and the indexes could be used by other applications as well. This is just a thought in the infancy stage, but it seems like it would solve the problem. Of course, the trick is figuring out how it fits into the API (or maybe it becomes a part of 2.0). Not sure if it is even feasible, but it seems like you could define interfaces for Readers and Writers that met the requirements to do this. This may be better discussed on the dev list. [EMAIL PROTECTED] 07/20/04 11:28AM Hi: I am trying to store some Databased like field values into lucene. I have my own way of storing field values in a customized format. I guess my question is wheather we can make the Reader/Writer classes, e.g. FieldReader, FieldWriter, DocumentReader/Writer classes non-final? I have asked to make the Lucene API less restrictive many many many times but got no replies. Is this request feasible? Thanks -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: speeding up lucene search
In general, yes. By splitting up a large index into smaller indicies, you are linearizing the search time. Furthermore, that allows you to make your search distributable. -John On Wed, 21 Jul 2004 13:00:28 +1000, Anson Lau [EMAIL PROTECTED] wrote: Hello guys, What are some general techniques to make lucene search faster? I'm thinking about splitting up the index. My current index has approx 1.8 million documents (small documents) and index size is about 550MB. Am I likely to get much gain out of splitting it up and use a multiparallelsearcher? Most of my search queries search queries search on 5-10 fields. Are there other things I should look at? Thanks to all, Anson - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
lucene cutomized indexing
Hi: I am trying to store some Databased like field values into lucene. I have my own way of storing field values in a customized format. I guess my question is wheather we can make the Reader/Writer classes, e.g. FieldReader, FieldWriter, DocumentReader/Writer classes non-final? I have asked to make the Lucene API less restrictive many many many times but got no replies. Is this request feasible? Thanks -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene cutomized indexing
Hi Daniel: There are few things I want to do to be able to customize lucene: 1) to be able to plug in a different similarity model (e.g. bayesian, vector space etc.) 2) to be able to store certain fields in its own format and provide corresponding readers. I may not want to store every field in the lexicon/inverted index structure. I may have fields that doesn't make sense to store the position or frequency information. 3) to be able to customize analyzers to add more information to the Token while doing tokenization. Oleg mentioned about the HayStack project. In the HayStack source code, they had to modifiy many lucene class to make them non-final in order to customzie. They make sure during deployment their versions gets loaded before the same classes in the lucene .jar. It is cumbersome, but it is a Lucene restriction they had to live with. I believe there are many other users feel the same way. If I write some classes that derives from the lucene API and it breaks, then it is my responsibility to fix it. I don't understand why it would add burden to the Lucene developers. Thanks -John On Tue, 20 Jul 2004 17:56:26 +0200, Daniel Naber [EMAIL PROTECTED] wrote: On Tuesday 20 July 2004 17:28, John Wang wrote: I have asked to make the Lucene API less restrictive many many many times but got no replies. I suggest you just change it in your source and see if it works. Then you can still explain what exactly you did and why it's useful. From the developers point-of-view having things non-final means more stuff is exposed and making changes is more difficult (unless one accepts that derived classes may break with the next update). Regards Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene cutomized indexing
On Tue, 20 Jul 2004 13:40:28 -0400, Erik Hatcher [EMAIL PROTECTED] wrote: On Jul 20, 2004, at 12:12 PM, John Wang wrote: There are few things I want to do to be able to customize lucene: [...] 3) to be able to customize analyzers to add more information to the Token while doing tokenization. I have already provided my opinion on this one - I think it would be fine to allow Token to be public. I'll let others respond to the additional requests you've made. Great, what processes need to be in place before this gets in the code base? Oleg mentioned about the HayStack project. In the HayStack source code, they had to modifiy many lucene class to make them non-final in order to customzie. They make sure during deployment their versions gets loaded before the same classes in the lucene .jar. It is cumbersome, but it is a Lucene restriction they had to live with. Wow - I didn't realize that they've made local changes. Did they post with requests for opening things up as you have? Did they submit patches with their local changes? I believe there are many other users feel the same way. Then they should speak up :) Well, I AM speaking up. So have some other people in earlier emails. But alike me, are getting ignored. The HayStack changes were needed specifically due to the fact that many classes are declared to be final and not extensible. If I write some classes that derives from the lucene API and it breaks, then it is my responsibility to fix it. I don't understand why it would add burden to the Lucene developers. Making things extensible for no good reason is asking for maintenance troubles later when you need more control internally. Lucene has been well designed from the start with extensibility only where it was needed in mind. It has evolved to be more open in very specific areas after careful consideration of the performance impact has been weighed. Breaking is not really the concern with extensibility, I don't think. Real-world use cases are needed to show that changes need to be made. I thought I gave many real-world use cases in the previous email. And evidently also applies to the Haystack project. What other information do we need to provide? I don't want to diverge from the Lucene codebase like Haystack has done. But I may not have a choice. Thanks -John Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene cutomized indexing
That is what exactly they did and that's probably what I have to do. But that means we are diverging from the lucene code base and future fixes and enhancements need to be synchronized and that maybe a pain. -John On Tue, 20 Jul 2004 20:03:05 +0200, Daniel Naber [EMAIL PROTECTED] wrote: On Tuesday 20 July 2004 18:12, John Wang wrote: They make sure during deployment their versions gets loaded before the same classes in the lucene .jar. I don't see why people cannot just make their own lucene.jar. Just remove the final and recompile. Finally, Lucene is Open Source. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is Field.java final?
I was running into the similar problems with Lucene classes being final. In my case the Token class. I sent out an email but no one responeded :( -John On Sat, 10 Jul 2004 15:50:28 -0700, Kevin A. Burton [EMAIL PROTECTED] wrote: I was going to create a new IDField class which just calls super( name, value, false, true, false) but noticed I was prevented because Field.java is final? Why is this? I can't see any harm in making it non-final... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing help
Hi Grant: Thanks for the options. How likely will the lucene file formats change? Are there really no more optiosn? :(... Thanks -John On Thu, 08 Jul 2004 08:50:44 -0400, Grant Ingersoll [EMAIL PROTECTED] wrote: Hi John, The source code is available from CVS, make it non-final and do what you need to do. Of course, you may have a hard time finding help later if you aren't using something everyone else is and your solution doesn't work... :-) If I understand correctly what you are trying to do, you already know all of the answers for indexing, you just want Lucene to do the retrieval side of the coin, correct? I suppose a crazy idea might be to write a program that took your info and output it in the Lucene file format, but that seems a bit like overkill. -Grant [EMAIL PROTECTED] 07/07/04 07:37PM Hi Doug: Thanks for the response! The solution you proposed is still a derivative of creating a dummy document stream. Taking the same example, java (5), lucene (6), VectorTokenStream would create a total of 11 Tokens whereas only 2 is neccessary. Given many documents with many terms and frequencies, it would create many extra Token instances. The reason I was looking to derving the Field class is because I can directly manipulate the FieldInfo by setting the frequency. But the class is final... Any other suggestions? Thanks -John On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting [EMAIL PROTECTED] wrote: John Wang wrote: While lucene tokenizes the words in the document, it counts the frequency and figures out the position, we are trying to bypass this stage: For each document, I have a set of words with a know frequency, e.g. java (5), lucene (6) etc. (I don't care about the position, so it can always be 0.) What I can do now is to create a dummy document, e.g. java java java java java lucene lucene lucene lucene lucene and pass it to lucene. This seems hacky and cumbersome. Is there a better alternative? I browsed around in the source code, but couldn't find anything. Write an analyzer that returns terms with the appropriate distribution. For example: public class VectorTokenStream extends TokenStream { private int term; private int freq; public VectorTokenStream(String[] terms, int[] freqs) { this.terms = terms; this.freqs = freqs; } public Token next() { if (freq == 0) { term++; if (term = terms.length) return null; freq = freqs[term]; } freq--; return new Token(terms[term], 0, 0); } } Document doc = new Document(); doc.add(Field.Text(content, )); indexWriter.addDocument(doc, new Analyzer() { public TokenStream tokenStream(String field, Reader reader) { return new VectorTokenStream(new String[] {java,lucene}, new int[] {5,6}); } }); Too bad the Field class is final, otherwise I can derive from it and do something on that line... Extending Field would not help. That's why it's final. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing help
Hi Grant: I have something that would extract only the important words from a document along with its importance, furthermore, these important words may not be physically in the document, it could be synonyms to some of the words in the document. So the output of a process for a document is a list of word/importance pairs. I want to be able to query using only these words on the document. I don't think Lucene has such capability. Can you suggest what I can do with the analysers process in doing this without replicating words/tokens? Thanks -John On Thu, 08 Jul 2004 11:10:07 -0400, Grant Ingersoll [EMAIL PROTECTED] wrote: Hey John, Those are just options, didn't say they were good ones! :-) I guess the real question is, what is the background of what you are trying to do? Presumably you have some other program that is generating frequencies for you, do you really need that in the current form? Can't the Lucene indexing engine act as a stand-in for this process since your end result _should_ be the same? The Lucene Analyzer process is quite flexible, I bet you could even find a way to hook in your existing tools into the Analyzer process. -Grant [EMAIL PROTECTED] 07/08/04 10:42AM Hi Grant: Thanks for the options. How likely will the lucene file formats change? Are there really no more optiosn? :(... Thanks -John On Thu, 08 Jul 2004 08:50:44 -0400, Grant Ingersoll [EMAIL PROTECTED] wrote: Hi John, The source code is available from CVS, make it non-final and do what you need to do. Of course, you may have a hard time finding help later if you aren't using something everyone else is and your solution doesn't work... :-) If I understand correctly what you are trying to do, you already know all of the answers for indexing, you just want Lucene to do the retrieval side of the coin, correct? I suppose a crazy idea might be to write a program that took your info and output it in the Lucene file format, but that seems a bit like overkill. -Grant [EMAIL PROTECTED] 07/07/04 07:37PM Hi Doug: Thanks for the response! The solution you proposed is still a derivative of creating a dummy document stream. Taking the same example, java (5), lucene (6), VectorTokenStream would create a total of 11 Tokens whereas only 2 is neccessary. Given many documents with many terms and frequencies, it would create many extra Token instances. The reason I was looking to derving the Field class is because I can directly manipulate the FieldInfo by setting the frequency. But the class is final... Any other suggestions? Thanks -John On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting [EMAIL PROTECTED] wrote: John Wang wrote: While lucene tokenizes the words in the document, it counts the frequency and figures out the position, we are trying to bypass this stage: For each document, I have a set of words with a know frequency, e.g. java (5), lucene (6) etc. (I don't care about the position, so it can always be 0.) What I can do now is to create a dummy document, e.g. java java java java java lucene lucene lucene lucene lucene and pass it to lucene. This seems hacky and cumbersome. Is there a better alternative? I browsed around in the source code, but couldn't find anything. Write an analyzer that returns terms with the appropriate distribution. For example: public class VectorTokenStream extends TokenStream { private int term; private int freq; public VectorTokenStream(String[] terms, int[] freqs) { this.terms = terms; this.freqs = freqs; } public Token next() { if (freq == 0) { term++; if (term = terms.length) return null; freq = freqs[term]; } freq--; return new Token(terms[term], 0, 0); } } Document doc = new Document(); doc.add(Field.Text(content, )); indexWriter.addDocument(doc, new Analyzer() { public TokenStream tokenStream(String field, Reader reader) { return new VectorTokenStream(new String[] {java,lucene}, new int[] {5,6}); } }); Too bad the Field class is final, otherwise I can derive from it and do something on that line... Extending Field would not help. That's why it's final. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands
Re: indexing help
Thanks Doug. I will do just that. Just for my education, can you maybe elaborate on using the implement an IndexReader that delivers a synthetic index approach? Thanks in advance -John On Thu, 08 Jul 2004 10:01:59 -0700, Doug Cutting [EMAIL PROTECTED] wrote: John Wang wrote: The solution you proposed is still a derivative of creating a dummy document stream. Taking the same example, java (5), lucene (6), VectorTokenStream would create a total of 11 Tokens whereas only 2 is neccessary. That's easy to fix. We just need to reuse the token: public class VectorTokenStream extends TokenStream { private int term = -1; private int freq = 0; private Token token; public VectorTokenStream(String[] terms, int[] freqs) { this.terms = terms; this.freqs = freqs; } public Token next() { if (freq == 0) { term++; if (term = terms.length) return null; token = new Token(terms[term], 0, 0); freq = freqs[term]; } freq--; return token; } } Then only two tokens are created, as you desire. If you for some reason don't want to create a dummy document stream, then you could instead implement an IndexReader that delivers a synthetic index for a single document. Then use IndexWriter.addIndexes() to turn this into a real, FSDirectory-based index. However that would be a lot more work and only very marginally faster. So I'd stick with the approach I've outlined above. (Note: this code has not been compiled or run. It may have bugs.) Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing help
Hi Doug: Thanks for the response! The solution you proposed is still a derivative of creating a dummy document stream. Taking the same example, java (5), lucene (6), VectorTokenStream would create a total of 11 Tokens whereas only 2 is neccessary. Given many documents with many terms and frequencies, it would create many extra Token instances. The reason I was looking to derving the Field class is because I can directly manipulate the FieldInfo by setting the frequency. But the class is final... Any other suggestions? Thanks -John On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting [EMAIL PROTECTED] wrote: John Wang wrote: While lucene tokenizes the words in the document, it counts the frequency and figures out the position, we are trying to bypass this stage: For each document, I have a set of words with a know frequency, e.g. java (5), lucene (6) etc. (I don't care about the position, so it can always be 0.) What I can do now is to create a dummy document, e.g. java java java java java lucene lucene lucene lucene lucene and pass it to lucene. This seems hacky and cumbersome. Is there a better alternative? I browsed around in the source code, but couldn't find anything. Write an analyzer that returns terms with the appropriate distribution. For example: public class VectorTokenStream extends TokenStream { private int term; private int freq; public VectorTokenStream(String[] terms, int[] freqs) { this.terms = terms; this.freqs = freqs; } public Token next() { if (freq == 0) { term++; if (term = terms.length) return null; freq = freqs[term]; } freq--; return new Token(terms[term], 0, 0); } } Document doc = new Document(); doc.add(Field.Text(content, )); indexWriter.addDocument(doc, new Analyzer() { public TokenStream tokenStream(String field, Reader reader) { return new VectorTokenStream(new String[] {java,lucene}, new int[] {5,6}); } }); Too bad the Field class is final, otherwise I can derive from it and do something on that line... Extending Field would not help. That's why it's final. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]