[jira] Resolved: (LUCENE-1095) StopFilter should have option to incr positionIncrement after stop word

2007-12-31 Thread Doron Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen resolved LUCENE-1095.
-

   Resolution: Fixed
Lucene Fields: [Patch Available]  (was: [New])

Committed . (already yesterday actually, I was sure that I already resolved 
it...)

> StopFilter should have option to incr positionIncrement after stop word
> ---
>
> Key: LUCENE-1095
> URL: https://issues.apache.org/jira/browse/LUCENE-1095
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Hoss Man
>Assignee: Doron Cohen
> Attachments: lucene-1095-pos-incr.patch, lucene-1095-pos-incr.patch, 
> lucene-1095-pos-incr.patch
>
>
> I've seen this come up on the mailing list a few times in the last month, so 
> i'm filing a known bug/improvement arround it...
> StopFilter should have an option that if set, records how many stop words are 
> "skipped" in a row, and then sets that value as the positionIncrement on the 
> "next" token that StopFilter does return.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1112) Document is partially indexed on an unhandled exception

2007-12-31 Thread Michael McCandless (JIRA)
Document is partially indexed on an unhandled exception
---

 Key: LUCENE-1112
 URL: https://issues.apache.org/jira/browse/LUCENE-1112
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.3
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.3


With LUCENE-843, it's now possible for a subset of a document's
fields/terms to be indexed or stored when an exception is hit.  This
was not the case in the past (it was "all or none").

I plan to make it "all or none" again by immediately marking a
document as deleted if any exception is hit while indexing it.

Discussion leading up to this:

  http://www.gossamer-threads.com/lists/lucene/java-dev/56103


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Let's release Lucene 2.3 soon?

2007-12-31 Thread Michael McCandless
I just opened a new issue, which I think should be fixed for 2.3, to
fix IndexWriter.add/updateDocument to not "partially add" a document
when an exception is hit:

  https://issues.apache.org/jira/browse/LUCENE-1112

I'll try to work out a patch by Thu but it may be tight...

Mike

Michael Busch <[EMAIL PROTECTED]> wrote:
> Michael Busch wrote:
> >
> > I think a good target would be to complete all 2.3 issues by end of this
> > year. Then we can start a code freeze beginning of next year, so that
> > we'll have 2.3 out hopefully by mid/end of January '08. I would
> > volunteer to act as the release manager again.
> >
>
> Hi Team,
>
> perfect timing! As of today all 2.3 issues are committed (thanks
> everyone!). If nobody objects I will create a 2.3 branch on Thursday,
> Jan 3rd, and we will have a code freeze on the branch for aprox. 10
> days. In this time period only critical/blocking issues and
> documentation patches can be committed to the branch.
>
> -Michael
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Michael McCandless
Doron Cohen <[EMAIL PROTECTED]> wrote:
> I like the approach of configuration of this behavior in Analysis
> (and so IndexWriter can throw an exception on such errors).
>
> It seems that this should be a property of Analyzer vs.
> just StandardAnalyzer, right?
>
> It can probably be a "policy" property, with two parameters:
> 1) maxLength, 2) action: chop/split/ignore/raiseException when
> generating too long tokens.

Agreed, this should be generic/shared to all analyzers.

But maybe for 2.3, we just truncate any too-long term to the max
allowed size, and then after 2.3 we make this a settable "policy"?

> Doron
>
> On Dec 21, 2007 10:46 PM, Michael McCandless <[EMAIL PROTECTED]>
> wrote:
>
> >
> > I think this is a good approach -- any objections?
> >
> > This way, IndexWriter is in-your-face (throws TermTooLongException on
> > seeing a massive term), but StandardAnalyzer is robust (silently
> > skips or prefix's the too-long terms).
> >
> > Mike
> >
> > Gabi Steinberg wrote:
> >
> > > How about defaulting to a max token size of 16K in
> > > StandardTokenizer, so that it never causes an IndexWriter
> > > exception, with an option to reduce that size?
> > >
> > > The backward incompatibilty is limited then - tokens exceeding 16K
> > > will NOT causing an IndexWriter exception.  In 3.0 we can reduce
> > > that default to a useful size.
> > >
> > > The option to truncate the token can be useful, I think.  It will
> > > index the max size prefix of the long tokens.  You can still find
> > > them, pretty accurately - this becomes a prefix search, but is
> > > unlikely to return multiple values because it's a long prefix.  It
> > > allow you to choose a relatively small max, such as 32 or 64,
> > > reducing the overhead caused by junk in the documents while
> > > minimizing the chance of not finding something.
> > >
> > > Gabi.
> > >
> > > Michael McCandless wrote:
> > >> Gabi Steinberg wrote:
> > >>> On balance, I think that dropping the document makes sense.  I
> > >>> think Yonik is right in that ensuring that keys are useful - and
> > >>> indexable - is the tokenizer's job.
> > >>>
> > >>> StandardTokenizer, in my opinion, should behave similarly to a
> > >>> person looking at a document and deciding which tokens should be
> > >>> indexed.  Few people would argue that a 16K block of binary data
> > >>> is useful for searching, but it's reasonable to suggest that the
> > >>> text around it is useful.
> > >>>
> > >>> I know that one can add the LengthFilter to avoid this problem,
> > >>> but this is not really intuitive; one does not expect the
> > >>> standard tokenizer to generate tokens that IndexWriter chokes on.
> > >>>
> > >>> My vote is to:
> > >>> - drop documents with tokens longer than 16K, as Mike and Yonik
> > >>> suggested
> > >>> - because uninformed user would start with StandardTokenizer, I
> > >>> think it should limit token size to 128 bytes, and add options to
> > >>> change that size, choose between truncating or dropping longer
> > >>> tokens, and in no case produce tokens longer that what
> > >>> IndexWriter can digest.
> > >> I like this idea, though we probably can't do that until 3.0 so we
> > >> don't break backwards compatibility?
> > > ...
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-488) adding docs with large (binary) fields of 5mb causes OOM regardless of heap size

2007-12-31 Thread Doron Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen resolved LUCENE-488.


Resolution: Fixed

This problem was resolved by LUCENE-843, after which stored fields are written 
directly into the directory (therefore not consuming aggregated RAM). 

It is interesting that the test provided here was allocating a new byte buffer 
of 2 - 10 MB for each added doc. This by itself couldeventually lead to OOMs 
because as the program ran longer it was becoming harder to alocate consecutive 
chunks of those sizes.  Enhancing binary fields with offset and length (?)  
would allow applications to reuse the input byte array and allocate less of 
those.

> adding docs with large (binary) fields of 5mb causes OOM regardless of heap 
> size
> 
>
> Key: LUCENE-488
> URL: https://issues.apache.org/jira/browse/LUCENE-488
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 1.9
> Environment: Linux asimov 2.6.6.hoss1 #1 SMP Tue Jul 6 16:31:01 PDT 
> 2004 i686 GNU/Linux
>Reporter: Hoss Man
> Attachments: TestBigBinary.java
>
>
> as reported by George Washington in a message to [EMAIL PROTECTED] with 
> subect "Storing large text or binary source documents in the index and memory 
> usage" arround 2006-01-21 there seems to be a problem with adding docs 
> containing really large fields.
> I'll attach a test case in a moment, note that (for me) regardless of how big 
> i make my heap size, and regardless of what value I set  MIN_MB to, once it 
> starts trying to make documents of containing 5mb of data, it can only add 9 
> before it rolls over and dies.
> here's the output from the code as i will attach in a moment...
> [junit] Testsuite: org.apache.lucene.document.TestBigBinary
> [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 78.656 sec
> [junit] - Standard Output ---
> [junit] NOTE: directory will not be cleaned up automatically...
> [junit] Dir: 
> /tmp/org.apache.lucene.document.TestBigBinary.97856146.100iters.4mb
> [junit] iters completed: 100
> [junit] totalBytes Allocated: 419430400
> [junit] NOTE: directory will not be cleaned up automatically...
> [junit] Dir: 
> /tmp/org.apache.lucene.document.TestBigBinary.97856146.100iters.5mb
> [junit] iters completed: 9
> [junit] totalBytes Allocated: 52428800
> [junit] -  ---
> [junit] Testcase: 
> testBigBinaryFields(org.apache.lucene.document.TestBigBinary):Caused an 
> ERROR
> [junit] Java heap space
> [junit] java.lang.OutOfMemoryError: Java heap space
> [junit] Test org.apache.lucene.document.TestBigBinary FAILED

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1112) Document is partially indexed on an unhandled exception

2007-12-31 Thread Doron Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-1112:


Attachment: lucene-1112-test.patch

Patch demonstrating the problem: testWickedLongTerm() modified to fail when 
numDocs grows although addDocument() throws an exception.

> Document is partially indexed on an unhandled exception
> ---
>
> Key: LUCENE-1112
> URL: https://issues.apache.org/jira/browse/LUCENE-1112
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: lucene-1112-test.patch
>
>
> With LUCENE-843, it's now possible for a subset of a document's
> fields/terms to be indexed or stored when an exception is hit.  This
> was not the case in the past (it was "all or none").
> I plan to make it "all or none" again by immediately marking a
> document as deleted if any exception is hit while indexing it.
> Discussion leading up to this:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/56103

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-458) Merging may create duplicates if the JVM crashes half way through

2007-12-31 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch resolved LUCENE-458.
--

Resolution: Duplicate

The problem here apparently is that when the JVM crashed not all files are 
properly synced with the FS.
This seems to be a similar problem to LUCENE-1044. 

> Merging may create duplicates if the JVM crashes half way through
> -
>
> Key: LUCENE-458
> URL: https://issues.apache.org/jira/browse/LUCENE-458
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 1.4
> Environment: Windows XP SP2, JDK 1.5.0_04 (crash occurred in this 
> version.  We've updated to 1.5.0_05 since, but discovered this issue with an 
> older text index since.)
>Reporter: Trejkaz
>
> In the past, our indexing process crashed due to a Hotspot compiler bug on 
> SMP systems (although it could happen with any bad native code.)  Everything 
> picked up and appeared to work, but now that it's a month later I've 
> discovered an oddity in the text index.
> We have two documents which are identical in the text index.  I know we only 
> stored it once for two reasons.  First, we store the MD5 of every document 
> into the hash and the MD5s were the same.  Second, we store a GUID into each 
> document which is generated uniquely for each document.  The GUID and the MD5 
> hash on these two documents, as well as all other fields, is exactly the same.
> My conclusion is that a merge was occurring at the point the JVM crashed, 
> which is consistent with the time the process crashed.  Is it possible that 
> Lucene did the copy of this document to the new location, and didn't get to 
> delete the original?
> If so, I guess this issue should be prevented somehow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-1102) EnwikiDocMaker id field

2007-12-31 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved LUCENE-1102.
-

   Resolution: Fixed
Lucene Fields:   (was: [New])

Committed

> EnwikiDocMaker id field
> ---
>
> Key: LUCENE-1102
> URL: https://issues.apache.org/jira/browse/LUCENE-1102
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-1102.patch
>
>
> The EnwikiDocMaker is fairly usable outside of the benchmarking class, but it 
> would benefit from indexing the ID field of the docs.
> Patch to follow that adds an ID field.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Let's release Lucene 2.3 soon?

2007-12-31 Thread Grant Ingersoll


On Dec 30, 2007, at 1:02 PM, Michael Busch wrote:


Grant Ingersoll wrote:


On Dec 30, 2007, at 6:29 AM, Michael Busch wrote:


In this time period only critical/blocking issues and
documentation patches can be committed to the branch.



I'd add that we should make some effort to clean up old JIRA  
issues...




I think I have a deja-vu! :) I think when we released 2.2 I forgot to
mention this and you reminded us!

But I don't think that we really cleaned up that much. Shall we do it
this time in a more coordinated manner? I think it would be great if  
we

could work through all unresolved issues that haven't been updated in
2007. I created a private JIRA filter (not sure how I can share a
private filter or create a public one?) and 117 issues is the result.

I will go ahead and open a couple of JIRA issues with type 'Task' and
fix version '2.3', a separate one for each package. Then whoever feels
comfortable with an issue can comment on it, close it or update it.

I'll also do a bulk update of all those 117 issues and change the
priority to minor. Any objections?


Cool.  I would just add that cleaned up doesn't necessarily mean  
closed, it could just mean reviewed and left opened b/c it is still  
valid.  That being said, my guess is that most of the really, really  
old ones could be marked as won't fix, although there are still some  
interesting ones in there, mostly having to do with adding features.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1113) fix for Document.getBoost() documentation

2007-12-31 Thread Daniel Naber (JIRA)
fix for Document.getBoost() documentation
-

 Key: LUCENE-1113
 URL: https://issues.apache.org/jira/browse/LUCENE-1113
 Project: Lucene - Java
  Issue Type: Bug
  Components: Javadocs
Affects Versions: 2.2
Reporter: Daniel Naber
Priority: Minor
 Attachments: document-getboost.diff

The attached patch fixes the javadoc to make clear that getBoost() will never 
return a useful value in most cases. I will commit this unless someone has a 
better wording or a real fix.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1113) fix for Document.getBoost() documentation

2007-12-31 Thread Daniel Naber (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Naber updated LUCENE-1113:
-

Attachment: document-getboost.diff

> fix for Document.getBoost() documentation
> -
>
> Key: LUCENE-1113
> URL: https://issues.apache.org/jira/browse/LUCENE-1113
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Javadocs
>Affects Versions: 2.2
>Reporter: Daniel Naber
>Priority: Minor
> Attachments: document-getboost.diff
>
>
> The attached patch fixes the javadoc to make clear that getBoost() will never 
> return a useful value in most cases. I will commit this unless someone has a 
> better wording or a real fix.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Fuzzy makes no sense for short tokens

2007-12-31 Thread Timo Nentwig
Hi!

it generally makes no sense to search fuzzy for short tokens because changing 
even only a single character of course already results in a high edit 
distance. So it actually only makes sense in this case:

   if( token.length() > 1f / (1f - minSimilarity) )

E.g. changing one character in a 3-letter token (foo) results in an edit 
distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher 
we can save all the expensive rewrite() logic.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1113) fix for Document.getBoost() documentation

2007-12-31 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12555110
 ] 

Doron Cohen commented on LUCENE-1113:
-

How about:
{noformat}
   Returns, at indexing time, the boost factor as set by [EMAIL PROTECTED] 
#setBoost(float)}. 

   Note that once a document is indexed this value is no longer available
   from the index.  At search time, for retrieved documents, this method always 
   returns 1. This however does not mean that the boost value set at  indexing 
   time was ignored - it was just combined with other indexing time factors and 
   stored elsewhere, for better indexing and search performance. (For more 
   info see the "norm(t,d)" part of the scoring formula in 
   [EMAIL PROTECTED] org.apache.lucene.search.Similarity Similarity}.
{noformat}

> fix for Document.getBoost() documentation
> -
>
> Key: LUCENE-1113
> URL: https://issues.apache.org/jira/browse/LUCENE-1113
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Javadocs
>Affects Versions: 2.2
>Reporter: Daniel Naber
>Priority: Minor
> Attachments: document-getboost.diff
>
>
> The attached patch fixes the javadoc to make clear that getBoost() will never 
> return a useful value in most cases. I will commit this unless someone has a 
> better wording or a real fix.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Yonik Seeley
On Dec 31, 2007 5:53 AM, Michael McCandless <[EMAIL PROTECTED]> wrote:
> Doron Cohen <[EMAIL PROTECTED]> wrote:
> > I like the approach of configuration of this behavior in Analysis
> > (and so IndexWriter can throw an exception on such errors).
> >
> > It seems that this should be a property of Analyzer vs.
> > just StandardAnalyzer, right?
> >
> > It can probably be a "policy" property, with two parameters:
> > 1) maxLength, 2) action: chop/split/ignore/raiseException when
> > generating too long tokens.
>
> Agreed, this should be generic/shared to all analyzers.
>
> But maybe for 2.3, we just truncate any too-long term to the max
> allowed size, and then after 2.3 we make this a settable "policy"?

But we already have a nice component model for analyzers...
why not just encapsulate truncation/discarding in a TokenFilter?

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Doron Cohen
On Dec 31, 2007 6:10 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:

> On Dec 31, 2007 5:53 AM, Michael McCandless <[EMAIL PROTECTED]>
> wrote:
> > Doron Cohen <[EMAIL PROTECTED]> wrote:
> > > I like the approach of configuration of this behavior in Analysis
> > > (and so IndexWriter can throw an exception on such errors).
> > >
> > > It seems that this should be a property of Analyzer vs.
> > > just StandardAnalyzer, right?
> > >
> > > It can probably be a "policy" property, with two parameters:
> > > 1) maxLength, 2) action: chop/split/ignore/raiseException when
> > > generating too long tokens.
> >
> > Agreed, this should be generic/shared to all analyzers.
> >
> > But maybe for 2.3, we just truncate any too-long term to the max
> > allowed size, and then after 2.3 we make this a settable "policy"?
>
> But we already have a nice component model for analyzers...
> why not just encapsulate truncation/discarding in a TokenFilter?


Makes sense, especially for the implementation aspect.
I'm not sure what API you have in mind:

(1) leave that for applications, to append such a
TokenFilter to their Analyzer (== no change),

(2) DocumentsWriter to create such a TokenFilter
 under the cover, to force behavior that is defined (where?), or

(3) have an IndexingTokenFilter assigned to IndexWriter,
 make the default such filter trim/ignore/whatever as discussed
 and then applications can set a different IndexingTokenFilter for
 changing the default behavior?

I think I like the 3'rd option - is this what you meant?

Doron


Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Yonik Seeley
On Dec 31, 2007 11:37 AM, Doron Cohen <[EMAIL PROTECTED]> wrote:
>
> On Dec 31, 2007 6:10 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
> > On Dec 31, 2007 5:53 AM, Michael McCandless <[EMAIL PROTECTED]>
> > wrote:
> > > Doron Cohen <[EMAIL PROTECTED]> wrote:
> > > > I like the approach of configuration of this behavior in Analysis
> > > > (and so IndexWriter can throw an exception on such errors).
> > > >
> > > > It seems that this should be a property of Analyzer vs.
> > > > just StandardAnalyzer, right?
> > > >
> > > > It can probably be a "policy" property, with two parameters:
> > > > 1) maxLength, 2) action: chop/split/ignore/raiseException when
> > > > generating too long tokens.
> > >
> > > Agreed, this should be generic/shared to all analyzers.
> > >
> > > But maybe for 2.3, we just truncate any too-long term to the max
> > > allowed size, and then after 2.3 we make this a settable "policy"?
> >
> > But we already have a nice component model for analyzers...
> > why not just encapsulate truncation/discarding in a TokenFilter?
>
>
> Makes sense, especially for the implementation aspect.
> I'm not sure what API you have in mind:
>
> (1) leave that for applications, to append such a
> TokenFilter to their Analyzer (== no change),
>
> (2) DocumentsWriter to create such a TokenFilter
>  under the cover, to force behavior that is defined (where?), or
>
> (3) have an IndexingTokenFilter assigned to IndexWriter,
>  make the default such filter trim/ignore/whatever as discussed
>  and then applications can set a different IndexingTokenFilter for
>  changing the default behavior?
>
> I think I like the 3'rd option - is this what you meant?

I meant (1)... it leaves the core smaller.
I don't see any reason to have logic to truncate or discard tokens in
the core indexing code (except to handle tokens >16k as an error
condition).

Most of the time you want to catch those large tokens early on in the
chain anyway (put the filter right after the tokenizer).  Doing it
later could cause exceptions or issues with other token filters that
might not be expecting huge tokens.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Grant Ingersoll


On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote:


On Dec 31, 2007 11:37 AM, Doron Cohen <[EMAIL PROTECTED]> wrote:


On Dec 31, 2007 6:10 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:

I think I like the 3'rd option - is this what you meant?


I meant (1)... it leaves the core smaller.
I don't see any reason to have logic to truncate or discard tokens in
the core indexing code (except to handle tokens >16k as an error
condition).


I would agree here, with the exception that I want the option for it  
to be treated as an error.  In some cases, I would be just as happy  
for it to silently ignore the token, or to log it.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Yonik Seeley
On Dec 31, 2007 11:59 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>
> On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote:
> > I meant (1)... it leaves the core smaller.
> > I don't see any reason to have logic to truncate or discard tokens in
> > the core indexing code (except to handle tokens >16k as an error
> > condition).
>
> I would agree here, with the exception that I want the option for it
> to be treated as an error.

That should also be possible via an analyzer component throwing an exception.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Grant Ingersoll


On Dec 31, 2007, at 12:11 PM, Yonik Seeley wrote:


On Dec 31, 2007 11:59 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:


On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote:

I meant (1)... it leaves the core smaller.
I don't see any reason to have logic to truncate or discard tokens  
in

the core indexing code (except to handle tokens >16k as an error
condition).


I would agree here, with the exception that I want the option for it
to be treated as an error.


That should also be possible via an analyzer component throwing an  
exception.




Sure, but I mean in the >16K (in other words, in the case where  
DocsWriter fails, which presumably only DocsWriter knows about) case.   
I want the option to ignore tokens larger than that instead of failing/ 
throwing an exception.  Imagine I am charged w/ indexing some data  
that I don't know anything about (i.e. computer forensics), my goal  
would be to index as much as possible in my first raw pass, so that I  
can then begin to explore the dataset.  Having it completely discard  
the document is not a good thing, but throwing away some large binary  
tokens would be acceptable (especially if I get warnings about said  
tokens) and robust.


-Grant


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Yonik Seeley
On Dec 31, 2007 12:25 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> Sure, but I mean in the >16K (in other words, in the case where
> DocsWriter fails, which presumably only DocsWriter knows about) case.
> I want the option to ignore tokens larger than that instead of failing/
> throwing an exception.

I think the issue here is what the default behavior for IndexWriter should be.

If configuration is required because something other than the default
is desired, then one could use a TokenFilter to change the behavior
rather than changing options on IndexWriter.  Using a TokenFilter is
much more flexible.

> Imagine I am charged w/ indexing some data
> that I don't know anything about (i.e. computer forensics), my goal
> would be to index as much as possible in my first raw pass, so that I
> can then begin to explore the dataset.  Having it completely discard
> the document is not a good thing, but throwing away some large binary
> tokens would be acceptable (especially if I get warnings about said
> tokens) and robust.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Michael McCandless
I actually think indexing should try to be as robust as possible.  You
could test like crazy and never hit a massive term, go into production
(say, ship your app to lots of your customer's computers) only to
suddenly see this exception.  In general it could be a long time before
you "accidentally" our users see this.

So I'm thinking we should have the default behavior, in IndexWriter,
be to skip immense terms?

Then people can use TokenFilter to change this behavior if they want.

Mike

Yonik Seeley <[EMAIL PROTECTED]> wrote:
> On Dec 31, 2007 12:25 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> > Sure, but I mean in the >16K (in other words, in the case where
> > DocsWriter fails, which presumably only DocsWriter knows about) case.
> > I want the option to ignore tokens larger than that instead of failing/
> > throwing an exception.
>
> I think the issue here is what the default behavior for IndexWriter should be.
>
> If configuration is required because something other than the default
> is desired, then one could use a TokenFilter to change the behavior
> rather than changing options on IndexWriter.  Using a TokenFilter is
> much more flexible.
>
> > Imagine I am charged w/ indexing some data
> > that I don't know anything about (i.e. computer forensics), my goal
> > would be to index as much as possible in my first raw pass, so that I
> > can then begin to explore the dataset.  Having it completely discard
> > the document is not a good thing, but throwing away some large binary
> > tokens would be acceptable (especially if I get warnings about said
> > tokens) and robust.
>
> -Yonik
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Yonik Seeley
On Dec 31, 2007 12:54 PM, Michael McCandless <[EMAIL PROTECTED]> wrote:
> I actually think indexing should try to be as robust as possible.  You
> could test like crazy and never hit a massive term, go into production
> (say, ship your app to lots of your customer's computers) only to
> suddenly see this exception.  In general it could be a long time before
> you "accidentally" our users see this.
>
> So I'm thinking we should have the default behavior, in IndexWriter,
> be to skip immense terms?
>
> Then people can use TokenFilter to change this behavior if they want.

+1

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1112) Document is partially indexed on an unhandled exception

2007-12-31 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12555119
 ] 

Michael McCandless commented on LUCENE-1112:


Thanks Doron; I'll fold this in (though, I'll move it to the 
testExceptionFromTokenStream case since it looks like we're going to no longer 
throw an exception on hitting a wicked-long-term).

> Document is partially indexed on an unhandled exception
> ---
>
> Key: LUCENE-1112
> URL: https://issues.apache.org/jira/browse/LUCENE-1112
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: lucene-1112-test.patch
>
>
> With LUCENE-843, it's now possible for a subset of a document's
> fields/terms to be indexed or stored when an exception is hit.  This
> was not the case in the past (it was "all or none").
> I plan to make it "all or none" again by immediately marking a
> document as deleted if any exception is hit while indexing it.
> Discussion leading up to this:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/56103

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Grant Ingersoll


On Dec 31, 2007, at 12:54 PM, Michael McCandless wrote:


I actually think indexing should try to be as robust as possible.  You
could test like crazy and never hit a massive term, go into production
(say, ship your app to lots of your customer's computers) only to
suddenly see this exception.  In general it could be a long time  
before

you "accidentally" our users see this.

So I'm thinking we should have the default behavior, in IndexWriter,
be to skip immense terms?

Then people can use TokenFilter to change this behavior if they want.


+1.  We could log it, right?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Let's release Lucene 2.3 soon?

2007-12-31 Thread Michael Busch
Michael McCandless wrote:
> I just opened a new issue, which I think should be fixed for 2.3, to
> fix IndexWriter.add/updateDocument to not "partially add" a document
> when an exception is hit:
> 
>   https://issues.apache.org/jira/browse/LUCENE-1112
> 
> I'll try to work out a patch by Thu but it may be tight...
> 

No rush! I can wait a couple more days before I create the branch.
Let's say next Monday (8th)?

-Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1114) contrib/Highlighter javadoc example needs to be updated

2007-12-31 Thread Grant Ingersoll (JIRA)
contrib/Highlighter javadoc example needs to be updated
---

 Key: LUCENE-1114
 URL: https://issues.apache.org/jira/browse/LUCENE-1114
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Reporter: Grant Ingersoll
Priority: Trivial


The Javadoc package.html example code is outdated, as it still uses 
QueryParser.parse.  

http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/contrib-highlighter/index.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1114) contrib/Highlighter javadoc example needs to be updated

2007-12-31 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12555145
 ] 

Grant Ingersoll commented on LUCENE-1114:
-

It also only demonstrates using the Analyzer to get the tokenStream, and not 
term vectors (TermSources)

> contrib/Highlighter javadoc example needs to be updated
> ---
>
> Key: LUCENE-1114
> URL: https://issues.apache.org/jira/browse/LUCENE-1114
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Reporter: Grant Ingersoll
>Priority: Trivial
>
> The Javadoc package.html example code is outdated, as it still uses 
> QueryParser.parse.  
> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/contrib-highlighter/index.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Michael McCandless

Grant Ingersoll wrote:



On Dec 31, 2007, at 12:54 PM, Michael McCandless wrote:

I actually think indexing should try to be as robust as possible.   
You
could test like crazy and never hit a massive term, go into  
production

(say, ship your app to lots of your customer's computers) only to
suddenly see this exception.  In general it could be a long time  
before

you "accidentally" our users see this.

So I'm thinking we should have the default behavior, in IndexWriter,
be to skip immense terms?

Then people can use TokenFilter to change this behavior if they want.


+1.  We could log it, right?


Yes, to IndexWriter's infoStream, if it's set.  I'll do that...

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Build failed in Hudson: Lucene-Nightly #321

2007-12-31 Thread hudson
See http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/321/changes

--
[...truncated 866 lines...]
A contrib/db/bdb-je/src/java
A contrib/db/bdb-je/src/java/org
A contrib/db/bdb-je/src/java/org/apache
A contrib/db/bdb-je/src/java/org/apache/lucene
A contrib/db/bdb-je/src/java/org/apache/lucene/store
A contrib/db/bdb-je/src/java/org/apache/lucene/store/je
A contrib/db/bdb-je/src/java/org/apache/lucene/store/je/File.java
A contrib/db/bdb-je/src/java/org/apache/lucene/store/je/JEDirectory.java
A 
contrib/db/bdb-je/src/java/org/apache/lucene/store/je/JEIndexInput.java
A 
contrib/db/bdb-je/src/java/org/apache/lucene/store/je/JEIndexOutput.java
A contrib/db/bdb-je/src/java/org/apache/lucene/store/je/JELock.java
A contrib/db/bdb-je/src/java/org/apache/lucene/store/je/Block.java
A contrib/db/bdb-je/build.xml
A contrib/db/bdb
A contrib/db/bdb/pom.xml.template
A contrib/db/bdb/src
A contrib/db/bdb/src/test
A contrib/db/bdb/src/test/org
A contrib/db/bdb/src/test/org/apache
A contrib/db/bdb/src/test/org/apache/lucene
A contrib/db/bdb/src/test/org/apache/lucene/store
A contrib/db/bdb/src/test/org/apache/lucene/store/db
A contrib/db/bdb/src/test/org/apache/lucene/store/db/DbStoreTest.java
AU
contrib/db/bdb/src/test/org/apache/lucene/store/db/SanityLoadLibrary.java
A contrib/db/bdb/src/java
A contrib/db/bdb/src/java/org
A contrib/db/bdb/src/java/org/apache
A contrib/db/bdb/src/java/org/apache/lucene
A contrib/db/bdb/src/java/org/apache/lucene/store
A contrib/db/bdb/src/java/org/apache/lucene/store/db
AUcontrib/db/bdb/src/java/org/apache/lucene/store/db/File.java
AUcontrib/db/bdb/src/java/org/apache/lucene/store/db/Block.java
AUcontrib/db/bdb/src/java/org/apache/lucene/store/db/DbDirectory.java
A contrib/db/bdb/src/java/org/apache/lucene/store/db/DbIndexInput.java
A contrib/db/bdb/src/java/org/apache/lucene/store/db/DbIndexOutput.java
AUcontrib/db/bdb/src/java/org/apache/lucene/store/db/DbLock.java
A contrib/db/bdb/src/java/com
A contrib/db/bdb/src/java/com/sleepycat
A contrib/db/bdb/src/java/com/sleepycat/db
AUcontrib/db/bdb/src/java/com/sleepycat/db/DbHandleExtractor.java
AUcontrib/db/bdb/build.xml
AUcontrib/db/build.xml
A contrib/similarity
A contrib/similarity/pom.xml.template
A contrib/similarity/src
A contrib/similarity/src/java
A contrib/similarity/src/java/org
A contrib/similarity/src/java/org/apache
A contrib/similarity/src/java/org/apache/lucene
A contrib/similarity/src/java/org/apache/lucene/search
A contrib/similarity/src/java/org/apache/lucene/search/similar
AU
contrib/similarity/src/java/org/apache/lucene/search/similar/package.html
AUcontrib/similarity/README.txt
AUcontrib/similarity/.cvsignore
AUcontrib/similarity/build.xml
A contrib/swing
A contrib/swing/pom.xml.template
A contrib/swing/src
A contrib/swing/src/test
A contrib/swing/src/test/org
A contrib/swing/src/test/org/apache
A contrib/swing/src/test/org/apache/lucene
A contrib/swing/src/test/org/apache/lucene/swing
A contrib/swing/src/test/org/apache/lucene/swing/models
A 
contrib/swing/src/test/org/apache/lucene/swing/models/TestSearchingList.java
A 
contrib/swing/src/test/org/apache/lucene/swing/models/BaseTableModel.java
A 
contrib/swing/src/test/org/apache/lucene/swing/models/TestUpdatingTable.java
A 
contrib/swing/src/test/org/apache/lucene/swing/models/RestaurantInfo.java
A 
contrib/swing/src/test/org/apache/lucene/swing/models/TableSearcherSimulator.java
A contrib/swing/src/test/org/apache/lucene/swing/models/DataStore.java
A 
contrib/swing/src/test/org/apache/lucene/swing/models/BaseListModel.java
A 
contrib/swing/src/test/org/apache/lucene/swing/models/TestUpdatingList.java
A 
contrib/swing/src/test/org/apache/lucene/swing/models/ListSearcherSimulator.java
A 
contrib/swing/src/test/org/apache/lucene/swing/models/TestBasicTable.java
A 
contrib/swing/src/test/org/apache/lucene/swing/models/TestSearchingTable.java
A 
contrib/swing/src/test/org/apache/lucene/swing/models/TestBasicList.java
A contrib/swing/src/java
A contrib/swing/src/java/org
A contrib/swing/src/java/org/apache
A contrib/swing/src/java/org/apache/lucene
A contrib/swing/src/java/org/apache/lucene/swing
A contrib/swing/src/java/org/apache/lucene/swing/models
A 
contrib/swing/src/java/org/apache/lucene/swing/models/TableSearcher.java
A 
contrib/swing/src/java/org/apache/luc

Re: DocumentsWriter.checkMaxTermLength issues

2007-12-31 Thread Doron Cohen
On Dec 31, 2007 7:54 PM, Michael McCandless <[EMAIL PROTECTED]>
wrote:

> I actually think indexing should try to be as robust as possible.  You
> could test like crazy and never hit a massive term, go into production
> (say, ship your app to lots of your customer's computers) only to
> suddenly see this exception.  In general it could be a long time before
> you "accidentally" our users see this.
>
> So I'm thinking we should have the default behavior, in IndexWriter,
> be to skip immense terms?
>
> Then people can use TokenFilter to change this behavior if they want.
>

+1

At first I saw this similar to IndexWriter.setMaxFieldLength(), but it was
a wrong comparison, because #terms is a "real" indexing/serarch
characteristic that many applications can benefit from being able
to modify, whereas a huge token is in most cases a bug.

Just to make sure on the scenario - the only change is to skip too long
tokens, while any other exception is thrown (not ignored.)

Also, for a skipped token I think the position increment of the
following token should be incremented.