date:20051212

IndexWriter.addIndexes(Directory[] dirs)

2005-12-12 Thread Volodymyr Bychkoviak

IndexWriter in addIndexes(Directory[] dirs) method optimizes index 
before and after operation.


Some notes about this:
1). Adding sub indexes to large index can take long because of double 
optimization.
2). This breaks IndexWriter.maxMergeDocs logic, because optimize will 
merge data into single segment index.


I suggest add new method with boolean parameter to optionally specify 
whether index should be optimized.


There is similar method addIndexes(IndexReader[] readers) in IndexWriter 
that takes array of IndexReaders but I don't know how it can be modified 
to provide same optional functionality


Patch attached here to discuss it first
(should I post it directly to jira?)


--
regards,
Volodymyr Bychkoviak

Index: 
D:/programming/projects/componence/lucene-dev/src/java/org/apache/lucene/index/IndexWriter.java
===
--- 
D:/programming/projects/componence/lucene-dev/src/java/org/apache/lucene/index/IndexWriter.java
 (revision 327185)
+++ 
D:/programming/projects/componence/lucene-dev/src/java/org/apache/lucene/index/IndexWriter.java
 (working copy)
@@ -519,17 +519,21 @@
   }
 
   /** Merges all segments from an array of indexes into this index.
-   *
-   * This may be used to parallelize batch indexing.  A large document
-   * collection can be broken into sub-collections.  Each sub-collection can be
-   * indexed in parallel, on a different thread, process or machine.  The
-   * complete index can then be created by merging sub-collection indexes
-   * with this method.
-   *
-   * After this completes, the index is optimized. */
-  public synchronized void addIndexes(Directory[] dirs)
+  *
+  * This may be used to parallelize batch indexing.  A large document
+  * collection can be broken into sub-collections.  Each sub-collection can be
+  * indexed in parallel, on a different thread, process or machine.  The
+  * complete index can then be created by merging sub-collection indexes
+  * with this method. 
+  * 
+  * Also optionally index can be optimized before and 
+  * after adding new data.
+  */
+  public synchronized void addIndexes(Directory[] dirs,boolean optimize)
   throws IOException {
-optimize();  // start with zero or 
1 seg
+if (optimize) {
+  optimize();// start with zero or 
1 seg
+}
 
 int start = segmentInfos.size();
 
@@ -550,7 +554,25 @@
   }
 }
 
-optimize();  // final cleanup
+if (optimize) {
+  optimize();// final cleanup
+} else {
+  maybeMergeSegments();
+}
+  }
+  
+  /** Merges all segments from an array of indexes into this index.
+  *
+  * This may be used to parallelize batch indexing.  A large document
+  * collection can be broken into sub-collections.  Each sub-collection can be
+  * indexed in parallel, on a different thread, process or machine.  The
+  * complete index can then be created by merging sub-collection indexes
+  * with this method.
+  *
+  * After this completes, the index is optimized. */
+  public synchronized void addIndexes(Directory[] dirs)
+  throws IOException {
+addIndexes(dirs,false);
   }
 
   /** Merges the provided indexes into this index.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Character encoding per index.

2005-12-12 Thread karl wettin


Hello list,

I'm looking for a way to change character encoding per index. It  
feels silly to store chinese characters in 3 bytes using UTF-8 when  
it is possible to do it with 2 bytes using UTF-16. By just hacking  
the IndexInput and IndexOutput I quick and dirty got it all running  
in UTF-16, but this is not good enough since I have other indexes  
that is more optimized when encoded in UTF-8.


The character encoding of Lucene today is quite static. In order to  
select encoding it seems to me I have to do some major refactoring to  
the project, passing a character codec from my analyzer (or perhaps  
IndexWriter/Reader) all the way down to the IndexInput/Output via  
TermVector/Info, et.c.


Can someone think of a better way to set character encoding per  
index? Or perhaps some other thought?


--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Character encoding per index.

2005-12-12 Thread karl wettin



12 dec 2005 kl. 16.40 skrev karl wettin:


Hello list,

I'm looking for a way to change character encoding per index. It  
feels silly to store chinese characters in 3 bytes using UTF-8 when  
it is possible to do it with 2 bytes using UTF-16. By just hacking  
the IndexInput and IndexOutput I quick and dirty got it all running  
in UTF-16, but this is not good enough since I have other indexes  
that is more optimized when encoded in UTF-8.


The character encoding of Lucene today is quite static. In order to  
select encoding it seems to me I have to do some major refactoring  
to the project, passing a character codec from my analyzer (or  
perhaps IndexWriter/Reader) all the way down to the IndexInput/ 
Output via TermVector/Info, et.c.


Can someone think of a better way to set character encoding per  
index? Or perhaps some other thought?


My current thought is to extend Directory  
(CharacterEncodingAwareDirectory or so) and all implementations of it  
to intercept the create/openFile methods and add a character encoding  
strategy to the IndexInput/Output.


Is there a reason for the write/readCharacters in IndexInput/Output  
to be final?


--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Character encoding per index.

2005-12-12 Thread Marvin Humphrey


On Dec 12, 2005, at 10:04 AM, karl wettin wrote:


12 dec 2005 kl. 16.40 skrev karl wettin:


Hello list,

I'm looking for a way to change character encoding per index. It  
feels silly to store chinese characters in 3 bytes using UTF-8  
when it is possible to do it with 2 bytes using UTF-16. By just  
hacking the IndexInput and IndexOutput I quick and dirty got it  
all running in UTF-16, but this is not good enough since I have  
other indexes that is more optimized when encoded in UTF-8.


The character encoding of Lucene today is quite static. In order  
to select encoding it seems to me I have to do some major  
refactoring to the project, passing a character codec from my  
analyzer (or perhaps IndexWriter/Reader) all the way down to the  
IndexInput/Output via TermVector/Info, et.c.


On a side note, this is another issue that I believe can be addressed  
by using a bytecount instead of a charcount at the head of Lucene's  
Strings.


A byte-based TermBuffer needn't care what encoding the Strings are in.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: IndexWriter.addIndexes(Directory[] dirs)

2005-12-12 Thread Kevin Oliver

Volodymyr, I tried this patch out, and unfortunately it doesn't appear
to work for me. Is there something I missed?

I'll try attaching my Junit test case that works when the code is
unpatched, but fails on the final assertion expecting 2 hits (on line
63) when I used the patched IndexWriter.java. 

Thanks, 
Kevin


-Original Message-
From: Volodymyr Bychkoviak [mailto:[EMAIL PROTECTED] 
Sent: Monday, December 12, 2005 5:51 AM
To: java-dev@lucene.apache.org
Subject: IndexWriter.addIndexes(Directory[] dirs)

IndexWriter in addIndexes(Directory[] dirs) method optimizes index 
before and after operation.

Some notes about this:
1). Adding sub indexes to large index can take long because of double 
optimization.
2). This breaks IndexWriter.maxMergeDocs logic, because optimize will 
merge data into single segment index.

I suggest add new method with boolean parameter to optionally specify 
whether index should be optimized.

There is similar method addIndexes(IndexReader[] readers) in IndexWriter

that takes array of IndexReaders but I don't know how it can be modified

to provide same optional functionality

Patch attached here to discuss it first
(should I post it directly to jira?)


-- 
regards,
Volodymyr Bychkoviak




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: IndexWriter.addIndexes(Directory[] dirs)

2005-12-12 Thread Kevin Oliver

I see it stripped my attachment off. Here's the code:

 

 

import junit.framework.TestCase;

 

import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.document.Document;

import org.apache.lucene.document.Field;

import org.apache.lucene.index.IndexWriter;

import org.apache.lucene.index.Term;

import org.apache.lucene.search.*;

import org.apache.lucene.store.Directory;

import org.apache.lucene.store.FSDirectory;

 

public class AddIndexesTest extends TestCase {

 

public AddIndexesTest(String name) {

super(name);

}

 

public void testAddIndexes() throws Exception {

{

Directory dir1 =
FSDirectory.getDirectory("/dev/searchdata/addIndexesTest1", true);

IndexWriter writer1 = new IndexWriter(dir1, new
StandardAnalyzer(), true);



Document doc1 = new Document();

doc1.add(Field.UnIndexed("ID", "id1"));

doc1.add(Field.UnStored("f", "some words"));

writer1.addDocument(doc1);



writer1.close();

dir1.close();



IndexSearcher searcher = new
IndexSearcher("/dev/searchdata/addIndexesTest1");

Hits hits = searcher.search(new TermQuery(new Term("f",
"words")));

assertEquals(1, hits.length());

searcher.close();

}



{

Directory dir2 =
FSDirectory.getDirectory("/dev/searchdata/addIndexesTest2", true);

IndexWriter writer2 = new IndexWriter(dir2, new
StandardAnalyzer(), true);



Document doc1 = new Document();

doc1.add(Field.UnIndexed("ID", "id2"));

doc1.add(Field.UnStored("f", "some other words"));

writer2.addDocument(doc1);



writer2.close();

dir2.close();

 

IndexSearcher searcher = new
IndexSearcher("/dev/searchdata/addIndexesTest2");

Hits hits = searcher.search(new TermQuery(new Term("f",
"words")));

assertEquals(1, hits.length());

searcher.close();

}



Directory dir =
FSDirectory.getDirectory("/dev/searchdata/addIndexesTest1", false);

IndexWriter writer = new IndexWriter(dir, new
StandardAnalyzer(), false);

writer.addIndexes(new Directory[] {
FSDirectory.getDirectory("/dev/searchdata/addIndexesTest2", false) });

writer.close();

dir.close();



IndexSearcher searcher = new
IndexSearcher("/dev/searchdata/addIndexesTest1");

Hits hits = searcher.search(new TermQuery(new Term("f",
"words")));

assertEquals(2, hits.length());

searcher.close();

}



}

 

 

-Original Message-
From: Kevin Oliver 
Sent: Monday, December 12, 2005 2:53 PM
To: java-dev@lucene.apache.org
Subject: RE: IndexWriter.addIndexes(Directory[] dirs)

 

Volodymyr, I tried this patch out, and unfortunately it doesn't appear

to work for me. Is there something I missed?

 

I'll try attaching my Junit test case that works when the code is

unpatched, but fails on the final assertion expecting 2 hits (on line

63) when I used the patched IndexWriter.java. 

 

Thanks, 

Kevin

 

 

-Original Message-

From: Volodymyr Bychkoviak [mailto:[EMAIL PROTECTED] 

Sent: Monday, December 12, 2005 5:51 AM

To: java-dev@lucene.apache.org

Subject: IndexWriter.addIndexes(Directory[] dirs)

 

IndexWriter in addIndexes(Directory[] dirs) method optimizes index 

before and after operation.

 

Some notes about this:

1). Adding sub indexes to large index can take long because of double 

optimization.

2). This breaks IndexWriter.maxMergeDocs logic, because optimize will 

merge data into single segment index.

 

I suggest add new method with boolean parameter to optionally specify 

whether index should be optimized.

 

There is similar method addIndexes(IndexReader[] readers) in IndexWriter

 

that takes array of IndexReaders but I don't know how it can be modified

 

to provide same optional functionality

 

Patch attached here to discuss it first

(should I post it directly to jira?)

 

 

-- 

regards,

Volodymyr Bychkoviak

Directory Implementation: Java Content Repository

2005-12-12 Thread Nicolas Belisle


Hi,

I've implemented a Directory (org.apache.lucene.store.Directory) using Java 
Content Repository (http://www.jcp.org/en/jsr/detail?id=170).


With it, indexes can be stored in any persistence technology supported by a 
Java Content Repository implementation. For example, Jackrabbit (the 
reference implementation - http://incubator.apache.org/jackrabbit/) 
currently supports relational databases (JDBC, Hibernate, OJB), file 
system, Berkeley DB Java Edition and more...


I wish to contribute the code to the community (if someone is interested in 
it). Where do I begin ?



Regards,

Nicolas Bélisle


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Directory Implementation: Java Content Repository

2005-12-12 Thread Erik Hatcher


Nicolas,

This is great news!   The first step to getting your code within the  
Lucene repository would be to organize it a manner similar to the  
other contributed projects.  I think this one fits nicely as contrib/ 
jcr.  License the code with the Apache license, get the build process  
working, hopefully with unit tests using the infrastructure already  
in place for the contrib projects, and finally add it as a patch to  
JIRA as a .zip file.


Thanks!

Erik


On Dec 12, 2005, at 6:20 PM, Nicolas Belisle wrote:


Hi,

I've implemented a Directory (org.apache.lucene.store.Directory)  
using Java Content Repository (http://www.jcp.org/en/jsr/detail? 
id=170).


With it, indexes can be stored in any persistence technology  
supported by a Java Content Repository implementation. For example,  
Jackrabbit (the reference implementation - http:// 
incubator.apache.org/jackrabbit/) currently supports relational  
databases (JDBC, Hibernate, OJB), file system, Berkeley DB Java  
Edition and more...


I wish to contribute the code to the community (if someone is  
interested in it). Where do I begin ?



Regards,

Nicolas Bélisle


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-323) [PATCH] MultiFieldQueryParser and BooleanQuery do not provide adequate support for queries across multiple fields

2005-12-12 Thread Yonik Seeley (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-323?page=comments#action_12360275 ] 

Yonik Seeley commented on LUCENE-323:
-

Thanks for the changes Chuck!

Your patch was backwards, BTW :-)

I haven't had a chance to run any benchmarks, but I committed this because it 
also fixes a bug.
Since it also looks like the uses of /2 and *2 were all unsigned, I replaced 
them with shifts.  The multiply doesn't matter much, but IDIV is horribly slow 
(between 20 and 80 cycles, depending on the arch and operands).  Not that I 
thought it was a bottleneck, but I have problems avoiding that "root of all 
evil", premature optimization ;-)


> [PATCH] MultiFieldQueryParser and BooleanQuery do not provide adequate 
> support for queries across multiple fields
> -
>
>  Key: LUCENE-323
>  URL: http://issues.apache.org/jira/browse/LUCENE-323
>  Project: Lucene - Java
> Type: Bug
>   Components: QueryParser
> Versions: 1.4
>  Environment: Operating System: Windows XP
> Platform: PC
> Reporter: Chuck Williams
> Assignee: Lucene Developers
>  Attachments: DisjunctionMaxQuery.java, DisjunctionMaxScorer.java, 
> TestDisjunctionMaxQuery.java, TestMaxDisjunctionQuery.java, TestRanking.zip, 
> TestRanking.zip, TestRanking.zip, WikipediaSimilarity.java, 
> WikipediaSimilarity.java, WikipediaSimilarity.java, dms.tar.gz
>
> The attached test case demonstrates this problem and provides a fix:
>   1.  Use a custom similarity to eliminate all tf and idf effects, just to 
> isolate what is being tested.
>   2.  Create two documents doc1 and doc2, each with two fields title and 
> description.  doc1 has "elephant" in title and "elephant" in description.  
> doc2 has "elephant" in title and "albino" in description.
>   3.  Express query for "albino elephant" against both fields.
> Problems:
>   a.  MultiFieldQueryParser won't recognize either document as containing 
> both terms, due to the way it expands the query across fields.
>   b.  Expressing query as "title:albino description:albino title:elephant 
> description:elephant" will score both documents equivalently, since each 
> matches two query terms.
>   4.  Comparison to MaxDisjunctionQuery and my method for expanding queries 
> across fields.  Using notation that () represents a BooleanQuery and ( | ) 
> represents a MaxDisjunctionQuery, "albino elephant" expands to:
> ( (title:albino | description:albino)
>   (title:elephant | description:elephant) )
> This will recognize that doc2 has both terms matched while doc1 only has 1 
> term matched, score doc2 over doc1.
> Refinement note:  the actual expansion for "albino query" that I use is:
> ( (title:albino | description:albino)~0.1
>   (title:elephant | description:elephant)~0.1 )
> This causes the score of each MaxDisjunctionQuery to be the score of highest 
> scoring MDQ subclause plus 0.1 times the sum of the scores of the other MDQ 
> subclauses.  Thus, doc1 gets some credit for also having "elephant" in the 
> description but only 1/10 as much as doc2 gets for covering another query 
> term 
> in its description.  If doc3 has "elephant" in title and both "albino" 
> and "elephant" in the description, then with the actual refined expansion, it 
> gets the highest score of all (whereas with pure max, without the 0.1, it 
> would get the same score as doc2).
> In real apps, tf's and idf's also come into play of course, but can affect 
> these either way (i.e., mitigate this fundamental problem or exacerbate it).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

IndexWriter.addIndexes(Directory[] dirs)

Character encoding per index.

Re: Character encoding per index.

Re: Character encoding per index.

RE: IndexWriter.addIndexes(Directory[] dirs)

RE: IndexWriter.addIndexes(Directory[] dirs)

Directory Implementation: Java Content Repository

Re: Directory Implementation: Java Content Repository

[jira] Commented: (LUCENE-323) [PATCH] MultiFieldQueryParser and BooleanQuery do not provide adequate support for queries across multiple fields

9 matches

Site Navigation

Mail list logo

Footer information