IndexWriter.addIndexes(Directory[] dirs)
IndexWriter in addIndexes(Directory[] dirs) method optimizes index before and after operation. Some notes about this: 1). Adding sub indexes to large index can take long because of double optimization. 2). This breaks IndexWriter.maxMergeDocs logic, because optimize will merge data into single segment index. I suggest add new method with boolean parameter to optionally specify whether index should be optimized. There is similar method addIndexes(IndexReader[] readers) in IndexWriter that takes array of IndexReaders but I don't know how it can be modified to provide same optional functionality Patch attached here to discuss it first (should I post it directly to jira?) -- regards, Volodymyr Bychkoviak Index: D:/programming/projects/componence/lucene-dev/src/java/org/apache/lucene/index/IndexWriter.java === --- D:/programming/projects/componence/lucene-dev/src/java/org/apache/lucene/index/IndexWriter.java (revision 327185) +++ D:/programming/projects/componence/lucene-dev/src/java/org/apache/lucene/index/IndexWriter.java (working copy) @@ -519,17 +519,21 @@ } /** Merges all segments from an array of indexes into this index. - * - * This may be used to parallelize batch indexing. A large document - * collection can be broken into sub-collections. Each sub-collection can be - * indexed in parallel, on a different thread, process or machine. The - * complete index can then be created by merging sub-collection indexes - * with this method. - * - * After this completes, the index is optimized. */ - public synchronized void addIndexes(Directory[] dirs) + * + * This may be used to parallelize batch indexing. A large document + * collection can be broken into sub-collections. Each sub-collection can be + * indexed in parallel, on a different thread, process or machine. The + * complete index can then be created by merging sub-collection indexes + * with this method. + * + * Also optionally index can be optimized before and + * after adding new data. + */ + public synchronized void addIndexes(Directory[] dirs,boolean optimize) throws IOException { -optimize(); // start with zero or 1 seg +if (optimize) { + optimize();// start with zero or 1 seg +} int start = segmentInfos.size(); @@ -550,7 +554,25 @@ } } -optimize(); // final cleanup +if (optimize) { + optimize();// final cleanup +} else { + maybeMergeSegments(); +} + } + + /** Merges all segments from an array of indexes into this index. + * + * This may be used to parallelize batch indexing. A large document + * collection can be broken into sub-collections. Each sub-collection can be + * indexed in parallel, on a different thread, process or machine. The + * complete index can then be created by merging sub-collection indexes + * with this method. + * + * After this completes, the index is optimized. */ + public synchronized void addIndexes(Directory[] dirs) + throws IOException { +addIndexes(dirs,false); } /** Merges the provided indexes into this index. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Character encoding per index.
Hello list, I'm looking for a way to change character encoding per index. It feels silly to store chinese characters in 3 bytes using UTF-8 when it is possible to do it with 2 bytes using UTF-16. By just hacking the IndexInput and IndexOutput I quick and dirty got it all running in UTF-16, but this is not good enough since I have other indexes that is more optimized when encoded in UTF-8. The character encoding of Lucene today is quite static. In order to select encoding it seems to me I have to do some major refactoring to the project, passing a character codec from my analyzer (or perhaps IndexWriter/Reader) all the way down to the IndexInput/Output via TermVector/Info, et.c. Can someone think of a better way to set character encoding per index? Or perhaps some other thought? -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Character encoding per index.
12 dec 2005 kl. 16.40 skrev karl wettin: Hello list, I'm looking for a way to change character encoding per index. It feels silly to store chinese characters in 3 bytes using UTF-8 when it is possible to do it with 2 bytes using UTF-16. By just hacking the IndexInput and IndexOutput I quick and dirty got it all running in UTF-16, but this is not good enough since I have other indexes that is more optimized when encoded in UTF-8. The character encoding of Lucene today is quite static. In order to select encoding it seems to me I have to do some major refactoring to the project, passing a character codec from my analyzer (or perhaps IndexWriter/Reader) all the way down to the IndexInput/ Output via TermVector/Info, et.c. Can someone think of a better way to set character encoding per index? Or perhaps some other thought? My current thought is to extend Directory (CharacterEncodingAwareDirectory or so) and all implementations of it to intercept the create/openFile methods and add a character encoding strategy to the IndexInput/Output. Is there a reason for the write/readCharacters in IndexInput/Output to be final? -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Character encoding per index.
On Dec 12, 2005, at 10:04 AM, karl wettin wrote: 12 dec 2005 kl. 16.40 skrev karl wettin: Hello list, I'm looking for a way to change character encoding per index. It feels silly to store chinese characters in 3 bytes using UTF-8 when it is possible to do it with 2 bytes using UTF-16. By just hacking the IndexInput and IndexOutput I quick and dirty got it all running in UTF-16, but this is not good enough since I have other indexes that is more optimized when encoded in UTF-8. The character encoding of Lucene today is quite static. In order to select encoding it seems to me I have to do some major refactoring to the project, passing a character codec from my analyzer (or perhaps IndexWriter/Reader) all the way down to the IndexInput/Output via TermVector/Info, et.c. On a side note, this is another issue that I believe can be addressed by using a bytecount instead of a charcount at the head of Lucene's Strings. A byte-based TermBuffer needn't care what encoding the Strings are in. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: IndexWriter.addIndexes(Directory[] dirs)
Volodymyr, I tried this patch out, and unfortunately it doesn't appear to work for me. Is there something I missed? I'll try attaching my Junit test case that works when the code is unpatched, but fails on the final assertion expecting 2 hits (on line 63) when I used the patched IndexWriter.java. Thanks, Kevin -Original Message- From: Volodymyr Bychkoviak [mailto:[EMAIL PROTECTED] Sent: Monday, December 12, 2005 5:51 AM To: java-dev@lucene.apache.org Subject: IndexWriter.addIndexes(Directory[] dirs) IndexWriter in addIndexes(Directory[] dirs) method optimizes index before and after operation. Some notes about this: 1). Adding sub indexes to large index can take long because of double optimization. 2). This breaks IndexWriter.maxMergeDocs logic, because optimize will merge data into single segment index. I suggest add new method with boolean parameter to optionally specify whether index should be optimized. There is similar method addIndexes(IndexReader[] readers) in IndexWriter that takes array of IndexReaders but I don't know how it can be modified to provide same optional functionality Patch attached here to discuss it first (should I post it directly to jira?) -- regards, Volodymyr Bychkoviak - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: IndexWriter.addIndexes(Directory[] dirs)
I see it stripped my attachment off. Here's the code: import junit.framework.TestCase; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.search.*; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; public class AddIndexesTest extends TestCase { public AddIndexesTest(String name) { super(name); } public void testAddIndexes() throws Exception { { Directory dir1 = FSDirectory.getDirectory("/dev/searchdata/addIndexesTest1", true); IndexWriter writer1 = new IndexWriter(dir1, new StandardAnalyzer(), true); Document doc1 = new Document(); doc1.add(Field.UnIndexed("ID", "id1")); doc1.add(Field.UnStored("f", "some words")); writer1.addDocument(doc1); writer1.close(); dir1.close(); IndexSearcher searcher = new IndexSearcher("/dev/searchdata/addIndexesTest1"); Hits hits = searcher.search(new TermQuery(new Term("f", "words"))); assertEquals(1, hits.length()); searcher.close(); } { Directory dir2 = FSDirectory.getDirectory("/dev/searchdata/addIndexesTest2", true); IndexWriter writer2 = new IndexWriter(dir2, new StandardAnalyzer(), true); Document doc1 = new Document(); doc1.add(Field.UnIndexed("ID", "id2")); doc1.add(Field.UnStored("f", "some other words")); writer2.addDocument(doc1); writer2.close(); dir2.close(); IndexSearcher searcher = new IndexSearcher("/dev/searchdata/addIndexesTest2"); Hits hits = searcher.search(new TermQuery(new Term("f", "words"))); assertEquals(1, hits.length()); searcher.close(); } Directory dir = FSDirectory.getDirectory("/dev/searchdata/addIndexesTest1", false); IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), false); writer.addIndexes(new Directory[] { FSDirectory.getDirectory("/dev/searchdata/addIndexesTest2", false) }); writer.close(); dir.close(); IndexSearcher searcher = new IndexSearcher("/dev/searchdata/addIndexesTest1"); Hits hits = searcher.search(new TermQuery(new Term("f", "words"))); assertEquals(2, hits.length()); searcher.close(); } } -Original Message- From: Kevin Oliver Sent: Monday, December 12, 2005 2:53 PM To: java-dev@lucene.apache.org Subject: RE: IndexWriter.addIndexes(Directory[] dirs) Volodymyr, I tried this patch out, and unfortunately it doesn't appear to work for me. Is there something I missed? I'll try attaching my Junit test case that works when the code is unpatched, but fails on the final assertion expecting 2 hits (on line 63) when I used the patched IndexWriter.java. Thanks, Kevin -Original Message- From: Volodymyr Bychkoviak [mailto:[EMAIL PROTECTED] Sent: Monday, December 12, 2005 5:51 AM To: java-dev@lucene.apache.org Subject: IndexWriter.addIndexes(Directory[] dirs) IndexWriter in addIndexes(Directory[] dirs) method optimizes index before and after operation. Some notes about this: 1). Adding sub indexes to large index can take long because of double optimization. 2). This breaks IndexWriter.maxMergeDocs logic, because optimize will merge data into single segment index. I suggest add new method with boolean parameter to optionally specify whether index should be optimized. There is similar method addIndexes(IndexReader[] readers) in IndexWriter that takes array of IndexReaders but I don't know how it can be modified to provide same optional functionality Patch attached here to discuss it first (should I post it directly to jira?) -- regards, Volodymyr Bychkoviak
Directory Implementation: Java Content Repository
Hi, I've implemented a Directory (org.apache.lucene.store.Directory) using Java Content Repository (http://www.jcp.org/en/jsr/detail?id=170). With it, indexes can be stored in any persistence technology supported by a Java Content Repository implementation. For example, Jackrabbit (the reference implementation - http://incubator.apache.org/jackrabbit/) currently supports relational databases (JDBC, Hibernate, OJB), file system, Berkeley DB Java Edition and more... I wish to contribute the code to the community (if someone is interested in it). Where do I begin ? Regards, Nicolas Bélisle - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Directory Implementation: Java Content Repository
Nicolas, This is great news! The first step to getting your code within the Lucene repository would be to organize it a manner similar to the other contributed projects. I think this one fits nicely as contrib/ jcr. License the code with the Apache license, get the build process working, hopefully with unit tests using the infrastructure already in place for the contrib projects, and finally add it as a patch to JIRA as a .zip file. Thanks! Erik On Dec 12, 2005, at 6:20 PM, Nicolas Belisle wrote: Hi, I've implemented a Directory (org.apache.lucene.store.Directory) using Java Content Repository (http://www.jcp.org/en/jsr/detail? id=170). With it, indexes can be stored in any persistence technology supported by a Java Content Repository implementation. For example, Jackrabbit (the reference implementation - http:// incubator.apache.org/jackrabbit/) currently supports relational databases (JDBC, Hibernate, OJB), file system, Berkeley DB Java Edition and more... I wish to contribute the code to the community (if someone is interested in it). Where do I begin ? Regards, Nicolas Bélisle - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-323) [PATCH] MultiFieldQueryParser and BooleanQuery do not provide adequate support for queries across multiple fields
[ http://issues.apache.org/jira/browse/LUCENE-323?page=comments#action_12360275 ] Yonik Seeley commented on LUCENE-323: - Thanks for the changes Chuck! Your patch was backwards, BTW :-) I haven't had a chance to run any benchmarks, but I committed this because it also fixes a bug. Since it also looks like the uses of /2 and *2 were all unsigned, I replaced them with shifts. The multiply doesn't matter much, but IDIV is horribly slow (between 20 and 80 cycles, depending on the arch and operands). Not that I thought it was a bottleneck, but I have problems avoiding that "root of all evil", premature optimization ;-) > [PATCH] MultiFieldQueryParser and BooleanQuery do not provide adequate > support for queries across multiple fields > - > > Key: LUCENE-323 > URL: http://issues.apache.org/jira/browse/LUCENE-323 > Project: Lucene - Java > Type: Bug > Components: QueryParser > Versions: 1.4 > Environment: Operating System: Windows XP > Platform: PC > Reporter: Chuck Williams > Assignee: Lucene Developers > Attachments: DisjunctionMaxQuery.java, DisjunctionMaxScorer.java, > TestDisjunctionMaxQuery.java, TestMaxDisjunctionQuery.java, TestRanking.zip, > TestRanking.zip, TestRanking.zip, WikipediaSimilarity.java, > WikipediaSimilarity.java, WikipediaSimilarity.java, dms.tar.gz > > The attached test case demonstrates this problem and provides a fix: > 1. Use a custom similarity to eliminate all tf and idf effects, just to > isolate what is being tested. > 2. Create two documents doc1 and doc2, each with two fields title and > description. doc1 has "elephant" in title and "elephant" in description. > doc2 has "elephant" in title and "albino" in description. > 3. Express query for "albino elephant" against both fields. > Problems: > a. MultiFieldQueryParser won't recognize either document as containing > both terms, due to the way it expands the query across fields. > b. Expressing query as "title:albino description:albino title:elephant > description:elephant" will score both documents equivalently, since each > matches two query terms. > 4. Comparison to MaxDisjunctionQuery and my method for expanding queries > across fields. Using notation that () represents a BooleanQuery and ( | ) > represents a MaxDisjunctionQuery, "albino elephant" expands to: > ( (title:albino | description:albino) > (title:elephant | description:elephant) ) > This will recognize that doc2 has both terms matched while doc1 only has 1 > term matched, score doc2 over doc1. > Refinement note: the actual expansion for "albino query" that I use is: > ( (title:albino | description:albino)~0.1 > (title:elephant | description:elephant)~0.1 ) > This causes the score of each MaxDisjunctionQuery to be the score of highest > scoring MDQ subclause plus 0.1 times the sum of the scores of the other MDQ > subclauses. Thus, doc1 gets some credit for also having "elephant" in the > description but only 1/10 as much as doc2 gets for covering another query > term > in its description. If doc3 has "elephant" in title and both "albino" > and "elephant" in the description, then with the actual refined expansion, it > gets the highest score of all (whereas with pure max, without the 0.1, it > would get the same score as doc2). > In real apps, tf's and idf's also come into play of course, but can affect > these either way (i.e., mitigate this fundamental problem or exacerbate it). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]