Re: Using FastVectorHighlighter for snippets

2010-09-23 Thread Devshree Sane
One more observation.
The length of the snippet returned is not equal to the  fragment length
specified.
Does anyone know the reason why?

On Wed, Sep 22, 2010 at 3:05 PM, Devshree Sane wrote:

> Thanks for your reply Koji.
>
> On Wed, Sep 22, 2010 at 4:51 AM, Koji Sekiguchi wrote:
>
>>  (10/09/22 3:24), Devshree Sane wrote:
>>
>>> I am a bit confused about the parameters that are passed to the
>>> FastVectorHighlighter.getBestFragments() method. One parameter is a
>>> document
>>> id and another is the maximum number of fragments. Does it mean that only
>>> the maximum number of fragments will be retrieved from document with
>>> given
>>> id (even if there are more fragments in the same document)?
>>>
>>>  Correct.
>>
>>
> I did a little experiment for this. Following are my observations.
> Changing the number of characters from 100 to 1000 decreased the number of
> fragments returned.
>
> Is this because the document text was covered with a few 1000 character
> fragments? If so, then this means that one fragment can contain more than
> one occurrence of the query term. Is this so? If yes, is there a way to find
> the number of occurrences of the query term inside a particular
> snippet/fragment?
>
> Also is there a way to get the beginning and ending positions/offsets in
> the document of the snippet/fragment being returned?
>
>
>
>
>


Problem with Numeric range query.

2010-09-23 Thread Daniel Sanders

I have a set of documents that all have a "timestamp" field which is stored as 
a long integer number. The field is indexed in my Lucene index as a number 
using NumericField with a precision step of 8: 

   Field field = new NumericField("timestamp", 8); 
   field.setLongValue( timestampValue); 

I do this so I can do numeric range queries to retrieve all documents that fall 
within a specific time range. 

The query I construct has two parts to it, a query, and a filter. I get the 
document hits as follows: 

   IndexReader reader = .. some index reader. 
   IndexSearcher searcher = new IndexSearcher(reader); 

   Filter filter = NumericRangeFilter.newLongRange("timestamp", 8, startTime, 
endTime, false, true); 
   Query query = new MatchAllDocsQuery(); 
   searcher.search( query, filter, myCollector); // My collector is a super 
class of Collector - saves all Hits 

Occasionally, I have a single document with a very specific timestamp I want to 
retrieve. Suppose that timestamp is timeX, I will create the filter as follows: 

   Filter filter = NumericRangeFilter.newLongRange("timestamp", 8, timeX-1, 
timeX, false, true); 

But with this filter, the document that should be found is never found. I have 
even tried expanding the time range as follows, but with no success: 

   Filter filter = NumericRangeFilter.newLongRange("timestamp", 8, timeX-1, 
timeX+500, false, true); 

Strangely, a filter that should NOT have found the document actually did find 
the document: 

   Filter filter = NumericRangeFilter.newLongRange("timestamp", 8, timeX, 
timeX+1000, false, true); 

This filter should NOT have found the document since the minInclusive argument 
is false. 

I have also noticed that sometimes when I have several documents with exactly 
the same timestamp, a query will return some, but not all, of the documents. 

I have also tried to use a NumericRangeQuery as follows: 

   Query query = NumericRangeQuery.newLongRange("timestamp", 8, timeX-1, timeX, 
false, true); 
   searcher.search( query, null, myCollector); 

This also fails to return my document(s). 

Am I doing something wrong here? Have I misunderstood how this is supposed to 
work? Has anyone else had problems like this? 


Thanks for any help or guidance or tips you can give me, 

-Daniel Sanders


RE: Problem with Numeric range query.

2010-09-23 Thread Uwe Schindler
Hi,

Can you provide some self-containing testcase that shows your problem? In
most cases those problems are caused by not committing changes to
IndexWriter before opening the IndexReader.

Additionally, if you only want to look for exactly one timestamp (like a
TermQuery), use a NumericRangeQuery with upper+lower inclusive = true and
use the specific value to search for as both upper and lower.

You may also hit a bug, that's already solved in SVN (it happens when the
lower bound is near Long.MAX_VALUE or the upper bound near Long.MIN_VALUE):
https://issues.apache.org/jira/browse/LUCENE-2541

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Daniel Sanders [mailto:dsand...@novell.com]
> Sent: Thursday, September 23, 2010 12:23 PM
> To: java-user@lucene.apache.org
> Subject: Problem with Numeric range query.
> 
> 
> I have a set of documents that all have a "timestamp" field which is
stored as a
> long integer number. The field is indexed in my Lucene index as a number
> using NumericField with a precision step of 8:
> 
>Field field = new NumericField("timestamp", 8);
>field.setLongValue( timestampValue);
> 
> I do this so I can do numeric range queries to retrieve all documents that
fall
> within a specific time range.
> 
> The query I construct has two parts to it, a query, and a filter. I get
the
> document hits as follows:
> 
>IndexReader reader = .. some index reader.
>IndexSearcher searcher = new IndexSearcher(reader);
> 
>Filter filter = NumericRangeFilter.newLongRange("timestamp", 8,
startTime,
> endTime, false, true);
>Query query = new MatchAllDocsQuery();
>searcher.search( query, filter, myCollector); // My collector is a
super class of
> Collector - saves all Hits
> 
> Occasionally, I have a single document with a very specific timestamp I
want to
> retrieve. Suppose that timestamp is timeX, I will create the filter as
follows:
> 
>Filter filter = NumericRangeFilter.newLongRange("timestamp", 8,
timeX-1,
> timeX, false, true);
> 
> But with this filter, the document that should be found is never found. I
have
> even tried expanding the time range as follows, but with no success:
> 
>Filter filter = NumericRangeFilter.newLongRange("timestamp", 8,
timeX-1,
> timeX+500, false, true);
> 
> Strangely, a filter that should NOT have found the document actually did
find
> the document:
> 
>Filter filter = NumericRangeFilter.newLongRange("timestamp", 8, timeX,
> timeX+1000, false, true);
> 
> This filter should NOT have found the document since the minInclusive
> argument is false.
> 
> I have also noticed that sometimes when I have several documents with
exactly
> the same timestamp, a query will return some, but not all, of the
documents.
> 
> I have also tried to use a NumericRangeQuery as follows:
> 
>Query query = NumericRangeQuery.newLongRange("timestamp", 8, timeX-1,
> timeX, false, true);
>searcher.search( query, null, myCollector);
> 
> This also fails to return my document(s).
> 
> Am I doing something wrong here? Have I misunderstood how this is supposed
> to work? Has anyone else had problems like this?
> 
> 
> Thanks for any help or guidance or tips you can give me,
> 
> -Daniel Sanders


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Problem with Numeric range query.

2010-09-23 Thread Daniel Sanders

Thank you for your timely response. 

It's going to take me longer to create an isolated test case you can test this 
with.  I will see what I can do. 

In the meantime, I have some follow up information in response to your other 
suggestions. 

1) I don't think my problem is that the IndexWriter has not committed the 
document.  Here's why: 


In my test case, I first retrieve a document using a different lucene query on 
a different field.  From that document I extract the value for timestamp field 
and then perform the NumericRangeQuery on that value as described below.  I was 
doing as a way to create a unit test that would verify that the 
NumericRangeQuery was working properly.  I think the fact that first query 
found the document is evidence that the IndexWriter had committed the document. 
 Hence, I would expect that if I follow that query with a NumericRangeQuery it 
should be able to find the same document. 

2) I also don't think my problem is values near Long.MIN_VALUE or 
Long.MAX_VALUE.  My values are all timestamps, which are positive integers that 
are not anywhere near those two extremes.  The values originally come from the 
java.util.Date.getTime() method. 

3) I will try the upper+lower inclusive = true and using same value for min and 
max, although I don't see how that will change anything.  I have actually 
debugged through the code for NumericRangeQuery, and if minInclusive == false, 
then min is incremented, and if maxInclusive == false, then max is decremented. 
 So my query: 

   NumericRangeQuery.newLongRange("timestamp",8,timeX-1,timeX,false,true) 

is essentially equivalent to the query you suggest trying: 

   NumericRangeQuery.newLongRange("timestamp",8,timeX,timeX,true,true) 

right? 

-Daniel Sanders 


>>> "Uwe Schindler"  9/23/2010 2:04 PM >>>
Hi,

Can you provide some self-containing testcase that shows your problem? In
most cases those problems are caused by not committing changes to
IndexWriter before opening the IndexReader.

Additionally, if you only want to look for exactly one timestamp (like a
TermQuery), use a NumericRangeQuery with upper+lower inclusive = true and
use the specific value to search for as both upper and lower.

You may also hit a bug, that's already solved in SVN (it happens when the
lower bound is near Long.MAX_VALUE or the upper bound near Long.MIN_VALUE):
https://issues.apache.org/jira/browse/LUCENE-2541

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Daniel Sanders [mailto:dsand...@novell.com]
> Sent: Thursday, September 23, 2010 12:23 PM
> To: java-user@lucene.apache.org
> Subject: Problem with Numeric range query.
>
>
> I have a set of documents that all have a "timestamp" field which is
stored as a
> long integer number. The field is indexed in my Lucene index as a number
> using NumericField with a precision step of 8:
>
>Field field = new NumericField("timestamp", 8);
>field.setLongValue( timestampValue);
>
> I do this so I can do numeric range queries to retrieve all documents that
fall
> within a specific time range.
>
> The query I construct has two parts to it, a query, and a filter. I get
the
> document hits as follows:
>
>IndexReader reader = .. some index reader.
>IndexSearcher searcher = new IndexSearcher(reader);
>
>Filter filter = NumericRangeFilter.newLongRange("timestamp", 8,
startTime,
> endTime, false, true);
>Query query = new MatchAllDocsQuery();
>searcher.search( query, filter, myCollector); // My collector is a
super class of
> Collector - saves all Hits
>
> Occasionally, I have a single document with a very specific timestamp I
want to
> retrieve. Suppose that timestamp is timeX, I will create the filter as
follows:
>
>Filter filter = NumericRangeFilter.newLongRange("timestamp", 8,
timeX-1,
> timeX, false, true);
>
> But with this filter, the document that should be found is never found. I
have
> even tried expanding the time range as follows, but with no success:
>
>Filter filter = NumericRangeFilter.newLongRange("timestamp", 8,
timeX-1,
> timeX+500, false, true);
>
> Strangely, a filter that should NOT have found the document actually did
find
> the document:
>
>Filter filter = NumericRangeFilter.newLongRange("timestamp", 8, timeX,
> timeX+1000, false, true);
>
> This filter should NOT have found the document since the minInclusive
> argument is false.
>
> I have also noticed that sometimes when I have several documents with
exactly
> the same timestamp, a query will return some, but not all, of the
documents.
>
> I have also tried to use a NumericRangeQuery as follows:
>
>Query query = NumericRangeQuery.newLongRange("timestamp", 8, timeX-1,
> timeX, false, true);
>searcher.search( query, null, myCollector);
>
> This also fails to return my document(s).
>
> Am I doing something wrong here? Have I misunderstood how this is supposed
> to work? Has anyone else had

RE: Problem with Numeric range query.

2010-09-23 Thread Uwe Schindler
Hi,

> Thank you for your timely response.

:-)

> It's going to take me longer to create an isolated test case you can test
this
> with.  I will see what I can do.

That would be fine. Often with a simple test those errors disappear, because
they are problem in the logic somewhere else :) But you should in all cases
try this.

> In the meantime, I have some follow up information in response to your
other
> suggestions.
> 
> 1) I don't think my problem is that the IndexWriter has not committed the
> document.  Here's why:
> 
> 
> In my test case, I first retrieve a document using a different lucene
query on a
> different field.  From that document I extract the value for timestamp
field and
> then perform the NumericRangeQuery on that value as described below.  I
was
> doing as a way to create a unit test that would verify that the
> NumericRangeQuery was working properly.  I think the fact that first query
> found the document is evidence that the IndexWriter had committed the
> document.  Hence, I would expect that if I follow that query with a
> NumericRangeQuery it should be able to find the same document.

Yes. But are you sure, that the timestamp is also indexed? If it's stored
only, it would not find that. Or maybe the other way round.

> 2) I also don't think my problem is values near Long.MIN_VALUE or
> Long.MAX_VALUE.  My values are all timestamps, which are positive integers
> that are not anywhere near those two extremes.  The values originally come
> from the java.util.Date.getTime() method.
> 
> 3) I will try the upper+lower inclusive = true and using same value for
min and
> max, although I don't see how that will change anything.  I have actually
> debugged through the code for NumericRangeQuery, and if minInclusive ==
> false, then min is incremented, and if maxInclusive == false, then max is
> decremented.  So my query:
> 
>NumericRangeQuery.newLongRange("timestamp",8,timeX-1,timeX,false,true)
> 
> is essentially equivalent to the query you suggest trying:
> 
>NumericRangeQuery.newLongRange("timestamp",8,timeX,timeX,true,true)
> 
> right?

Yes, it is the same. The Lucene test
TestNumericRangeQuery64.testOneMatchQuery() verifies the upper=lower
inclusive=true thing.

Uwe


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Problem with Numeric range query.

2010-09-23 Thread Daniel Sanders

I'm certain the timestamp field is being indexed.  It is created as follows: 

   Document doc = new Document(); 
 
   NumericField timeField = new NumericField("timestamp", 8); // Defaults to 
indexing = true. 
   timeField.setLongValue( timeX); 
   doc.add( timeField); 
... 
   indexWriter.addDocument(doc); 
... 
   indexWriter.commit(); 

-Daniel 



>>> "Uwe Schindler"  9/23/2010 3:02 PM >>>
Hi,

> Thank you for your timely response.

:-)

> It's going to take me longer to create an isolated test case you can test
this
> with.  I will see what I can do.

That would be fine. Often with a simple test those errors disappear, because
they are problem in the logic somewhere else :) But you should in all cases
try this.

> In the meantime, I have some follow up information in response to your
other
> suggestions.
>
> 1) I don't think my problem is that the IndexWriter has not committed the
> document.  Here's why:
>
>
> In my test case, I first retrieve a document using a different lucene
query on a
> different field.  From that document I extract the value for timestamp
field and
> then perform the NumericRangeQuery on that value as described below.  I
was
> doing as a way to create a unit test that would verify that the
> NumericRangeQuery was working properly.  I think the fact that first query
> found the document is evidence that the IndexWriter had committed the
> document.  Hence, I would expect that if I follow that query with a
> NumericRangeQuery it should be able to find the same document.

Yes. But are you sure, that the timestamp is also indexed? If it's stored
only, it would not find that. Or maybe the other way round.

> 2) I also don't think my problem is values near Long.MIN_VALUE or
> Long.MAX_VALUE.  My values are all timestamps, which are positive integers
> that are not anywhere near those two extremes.  The values originally come
> from the java.util.Date.getTime() method.
>
> 3) I will try the upper+lower inclusive = true and using same value for
min and
> max, although I don't see how that will change anything.  I have actually
> debugged through the code for NumericRangeQuery, and if minInclusive ==
> false, then min is incremented, and if maxInclusive == false, then max is
> decremented.  So my query:
>
>NumericRangeQuery.newLongRange("timestamp",8,timeX-1,timeX,false,true)
>
> is essentially equivalent to the query you suggest trying:
>
>NumericRangeQuery.newLongRange("timestamp",8,timeX,timeX,true,true)
>
> right?

Yes, it is the same. The Lucene test
TestNumericRangeQuery64.testOneMatchQuery() verifies the upper=lower
inclusive=true thing.

Uwe


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



ArrayIndexOutOfBoundsException when iterating over TermDocs

2010-09-23 Thread Shay Banon
Hi,

A user got this very strange exception, and I managed to get the index
that it happens on. Basically, iterating over the TermDocs causes an AAOIB
exception. I easily reproduced it using the FieldCache which does exactly
that (the field in question is indexed as numeric). Here is the exception:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 114
at org.apache.lucene.util.BitVector.get(BitVector.java:104)
 at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127)
at
org.apache.lucene.search.FieldCacheImpl$LongCache.createValue(FieldCacheImpl.java:501)
 at
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:183)
at org.apache.lucene.search.FieldCacheImpl.getLongs(FieldCacheImpl.java:470)
 at TestMe.main(TestMe.java:56)

It happens on the following segment: _26t docCount: 914 delCount: 1
delFileName: _26t_1.del

And as you can see, it smells like a corner case (it fails for document
number 912, the AIOOB happens from the deleted docs). The code to recreate
it is simple:

FSDirectory dir = FSDirectory.open(new File("index"));
IndexReader reader = IndexReader.open(dir, true);

IndexReader[] subReaders = reader.getSequentialSubReaders();
for (IndexReader subReader : subReaders) {
Field field =
subReader.getClass().getSuperclass().getDeclaredField("si");
field.setAccessible(true);
SegmentInfo si = (SegmentInfo) field.get(subReader);
System.out.println("--> " + si);
if (si.getDocStoreSegment().contains("_26t")) {
// this is the probleatic one...
System.out.println("problematic one...");
FieldCache.DEFAULT.getLongs(subReader, "__documentdate",
FieldCache.NUMERIC_UTILS_LONG_PARSER);
}
}

Here is the result of a check index on that segment:

  8 of 10: name=_26t docCount=914
compound=true
hasProx=true
numFiles=2
size (MB)=1.641
diagnostics = {optimize=false, mergeFactor=10,
os.version=2.6.18-194.11.1.el5.centos.plus, os=Linux, mergeDocStores=true,
lucene.version=3.0.2 953716 - 2010-06-11 17:13:53, source=merge,
os.arch=amd64, java.version=1.6.0, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_26t_1.del]
test: open reader.OK [1 deleted docs]
test: fields..OK [32 fields]
test: field norms.OK [32 fields]
test: terms, freq, prox...ERROR [114]
java.lang.ArrayIndexOutOfBoundsException: 114
at org.apache.lucene.util.BitVector.get(BitVector.java:104)
 at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127)
at
org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:102)
 at org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:616)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:509)
 at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299)
at TestMe.main(TestMe.java:47)
test: stored fields...ERROR [114]
java.lang.ArrayIndexOutOfBoundsException: 114
at org.apache.lucene.util.BitVector.get(BitVector.java:104)
 at
org.apache.lucene.index.ReadOnlySegmentReader.isDeleted(ReadOnlySegmentReader.java:34)
at org.apache.lucene.index.CheckIndex.testStoredFields(CheckIndex.java:684)
 at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:512)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299)
 at TestMe.main(TestMe.java:47)
test: term vectorsERROR [114]
java.lang.ArrayIndexOutOfBoundsException: 114
at org.apache.lucene.util.BitVector.get(BitVector.java:104)
 at
org.apache.lucene.index.ReadOnlySegmentReader.isDeleted(ReadOnlySegmentReader.java:34)
at org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:721)
 at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:515)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299)
 at TestMe.main(TestMe.java:47)



The creation of the index does not do something fancy (all defaults), though
there is usage of the near real time aspect (IndexWriter#getReader) which
does complicate deleted docs handling. Seems like the deleted docs got
written without matching the number of docs?. Sadly, I don't have something
that recreates it from scratch, but I do have the index if someone want to
have a look at it (mail me directly and I will provide a download link).

I will continue to investigate why this might happen, just wondering if
someone stumbled on this exception before. Lucene 3.0.2 is used.

-shay.banon


In lucene 2.3.2, needs to stop optimization?

2010-09-23 Thread Zhang, Lisheng
Hi,

We are using lucene 2.3.2, now we need to index each document as
fast as possible, so user can almost immediately search it. 

So I am considering stop IndexWriter optimization during real time, 
then in relatively off-time like late night we may call IndexWriter optimize
method explicitly once.

What is the most efficient way to completely turn off IndexWriter merge
in lucene 2.3.2?

Thanks very much for helps, Lisheng

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ArrayIndexOutOfBoundsException when iterating over TermDocs

2010-09-23 Thread Simon Willnauer
Shay,

would you mind open a jira issue for that?

simon

On Fri, Sep 24, 2010 at 2:53 AM, Shay Banon  wrote:
> Hi,
>
>    A user got this very strange exception, and I managed to get the index
> that it happens on. Basically, iterating over the TermDocs causes an AAOIB
> exception. I easily reproduced it using the FieldCache which does exactly
> that (the field in question is indexed as numeric). Here is the exception:
>
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 114
> at org.apache.lucene.util.BitVector.get(BitVector.java:104)
>  at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127)
> at
> org.apache.lucene.search.FieldCacheImpl$LongCache.createValue(FieldCacheImpl.java:501)
>  at
> org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:183)
> at org.apache.lucene.search.FieldCacheImpl.getLongs(FieldCacheImpl.java:470)
>  at TestMe.main(TestMe.java:56)
>
> It happens on the following segment: _26t docCount: 914 delCount: 1
> delFileName: _26t_1.del
>
> And as you can see, it smells like a corner case (it fails for document
> number 912, the AIOOB happens from the deleted docs). The code to recreate
> it is simple:
>
>        FSDirectory dir = FSDirectory.open(new File("index"));
>        IndexReader reader = IndexReader.open(dir, true);
>
>        IndexReader[] subReaders = reader.getSequentialSubReaders();
>        for (IndexReader subReader : subReaders) {
>            Field field =
> subReader.getClass().getSuperclass().getDeclaredField("si");
>            field.setAccessible(true);
>            SegmentInfo si = (SegmentInfo) field.get(subReader);
>            System.out.println("--> " + si);
>            if (si.getDocStoreSegment().contains("_26t")) {
>                // this is the probleatic one...
>                System.out.println("problematic one...");
>                FieldCache.DEFAULT.getLongs(subReader, "__documentdate",
> FieldCache.NUMERIC_UTILS_LONG_PARSER);
>            }
>        }
>
> Here is the result of a check index on that segment:
>
>  8 of 10: name=_26t docCount=914
>    compound=true
>    hasProx=true
>    numFiles=2
>    size (MB)=1.641
>    diagnostics = {optimize=false, mergeFactor=10,
> os.version=2.6.18-194.11.1.el5.centos.plus, os=Linux, mergeDocStores=true,
> lucene.version=3.0.2 953716 - 2010-06-11 17:13:53, source=merge,
> os.arch=amd64, java.version=1.6.0, java.vendor=Sun Microsystems Inc.}
>    has deletions [delFileName=_26t_1.del]
>    test: open reader.OK [1 deleted docs]
>    test: fields..OK [32 fields]
>    test: field norms.OK [32 fields]
>    test: terms, freq, prox...ERROR [114]
> java.lang.ArrayIndexOutOfBoundsException: 114
> at org.apache.lucene.util.BitVector.get(BitVector.java:104)
>  at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127)
> at
> org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:102)
>  at org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:616)
> at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:509)
>  at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299)
> at TestMe.main(TestMe.java:47)
>    test: stored fields...ERROR [114]
> java.lang.ArrayIndexOutOfBoundsException: 114
> at org.apache.lucene.util.BitVector.get(BitVector.java:104)
>  at
> org.apache.lucene.index.ReadOnlySegmentReader.isDeleted(ReadOnlySegmentReader.java:34)
> at org.apache.lucene.index.CheckIndex.testStoredFields(CheckIndex.java:684)
>  at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:512)
> at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299)
>  at TestMe.main(TestMe.java:47)
>    test: term vectorsERROR [114]
> java.lang.ArrayIndexOutOfBoundsException: 114
> at org.apache.lucene.util.BitVector.get(BitVector.java:104)
>  at
> org.apache.lucene.index.ReadOnlySegmentReader.isDeleted(ReadOnlySegmentReader.java:34)
> at org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:721)
>  at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:515)
> at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299)
>  at TestMe.main(TestMe.java:47)
>
>
>
> The creation of the index does not do something fancy (all defaults), though
> there is usage of the near real time aspect (IndexWriter#getReader) which
> does complicate deleted docs handling. Seems like the deleted docs got
> written without matching the number of docs?. Sadly, I don't have something
> that recreates it from scratch, but I do have the index if someone want to
> have a look at it (mail me directly and I will provide a download link).
>
> I will continue to investigate why this might happen, just wondering if
> someone stumbled on this exception before. Lucene 3.0.2 is used.
>
> -shay.banon
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: jav

RE: In lucene 2.3.2, needs to stop optimization?

2010-09-23 Thread Zhang, Lisheng
Hi,

I read document/code and did some experiments, one possibility
is to raise mergeFactor to high value, say close to 2Billion,
then a lot of small files are created and after >500 docs are
created separately, search speed dropped sharply.

I noticed with our current data, if I add one doc then call 
optimize(), it took about 7s, this is too slow for real time
search. 

If I keep mergeFactor as 10 and donot call optimize(), does it
mean from time to time IndexWriter would optimize on background,
when it happens, it may take a few seconds (so Index will delay
a few seconds)?

Should I use high mergeFactor and optimize once a day, or use
default mergeFactor and donot call optimize? maybe latter is
better, but I am concerned about occasional slowness?

Currently I donot plan to keep IndexWriter constantly open, but
open/close for each index request.

Any suggestion to improve would be appreciated,

Lisheng 

-Original Message-
From: Zhang, Lisheng [mailto:lisheng.zh...@broadvision.com]
Sent: Thursday, September 23, 2010 6:11 PM
To: java-user@lucene.apache.org
Subject: In lucene 2.3.2, needs to stop optimization?


Hi,

We are using lucene 2.3.2, now we need to index each document as
fast as possible, so user can almost immediately search it. 

So I am considering stop IndexWriter optimization during real time, 
then in relatively off-time like late night we may call IndexWriter optimize
method explicitly once.

What is the most efficient way to completely turn off IndexWriter merge
in lucene 2.3.2?

Thanks very much for helps, Lisheng

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org