Doron Cohen wrote:
On Dec 31, 2007 7:54 PM, Michael McCandless
<[EMAIL PROTECTED]>
wrote:
I actually think indexing should try to be as robust as possible.
You
could test like crazy and never hit a massive term, go into
production
(say, ship your app to lots of your customer's computer
On Dec 31, 2007 7:54 PM, Michael McCandless <[EMAIL PROTECTED]>
wrote:
> I actually think indexing should try to be as robust as possible. You
> could test like crazy and never hit a massive term, go into production
> (say, ship your app to lots of your customer's computers) only to
> suddenly se
Grant Ingersoll wrote:
On Dec 31, 2007, at 12:54 PM, Michael McCandless wrote:
I actually think indexing should try to be as robust as possible.
You
could test like crazy and never hit a massive term, go into
production
(say, ship your app to lots of your customer's computers) only to
su
On Dec 31, 2007, at 12:54 PM, Michael McCandless wrote:
I actually think indexing should try to be as robust as possible. You
could test like crazy and never hit a massive term, go into production
(say, ship your app to lots of your customer's computers) only to
suddenly see this exception. I
On Dec 31, 2007 12:54 PM, Michael McCandless <[EMAIL PROTECTED]> wrote:
> I actually think indexing should try to be as robust as possible. You
> could test like crazy and never hit a massive term, go into production
> (say, ship your app to lots of your customer's computers) only to
> suddenly se
I actually think indexing should try to be as robust as possible. You
could test like crazy and never hit a massive term, go into production
(say, ship your app to lots of your customer's computers) only to
suddenly see this exception. In general it could be a long time before
you "accidentally"
On Dec 31, 2007 12:25 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> Sure, but I mean in the >16K (in other words, in the case where
> DocsWriter fails, which presumably only DocsWriter knows about) case.
> I want the option to ignore tokens larger than that instead of failing/
> throwing an exce
On Dec 31, 2007, at 12:11 PM, Yonik Seeley wrote:
On Dec 31, 2007 11:59 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote:
I meant (1)... it leaves the core smaller.
I don't see any reason to have logic to truncate or discard tokens
in
the cor
On Dec 31, 2007 11:59 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>
> On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote:
> > I meant (1)... it leaves the core smaller.
> > I don't see any reason to have logic to truncate or discard tokens in
> > the core indexing code (except to handle tokens >16
On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote:
On Dec 31, 2007 11:37 AM, Doron Cohen <[EMAIL PROTECTED]> wrote:
On Dec 31, 2007 6:10 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
I think I like the 3'rd option - is this what you meant?
I meant (1)... it leaves the core smaller.
I don't se
On Dec 31, 2007 11:37 AM, Doron Cohen <[EMAIL PROTECTED]> wrote:
>
> On Dec 31, 2007 6:10 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
> > On Dec 31, 2007 5:53 AM, Michael McCandless <[EMAIL PROTECTED]>
> > wrote:
> > > Doron Cohen <[EMAIL PROTECTED]> wrote:
> > > > I like the approach of configur
On Dec 31, 2007 6:10 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> On Dec 31, 2007 5:53 AM, Michael McCandless <[EMAIL PROTECTED]>
> wrote:
> > Doron Cohen <[EMAIL PROTECTED]> wrote:
> > > I like the approach of configuration of this behavior in Analysis
> > > (and so IndexWriter can throw an exce
On Dec 31, 2007 5:53 AM, Michael McCandless <[EMAIL PROTECTED]> wrote:
> Doron Cohen <[EMAIL PROTECTED]> wrote:
> > I like the approach of configuration of this behavior in Analysis
> > (and so IndexWriter can throw an exception on such errors).
> >
> > It seems that this should be a property of An
Doron Cohen <[EMAIL PROTECTED]> wrote:
> I like the approach of configuration of this behavior in Analysis
> (and so IndexWriter can throw an exception on such errors).
>
> It seems that this should be a property of Analyzer vs.
> just StandardAnalyzer, right?
>
> It can probably be a "policy" prop
I like the approach of configuration of this behavior in Analysis
(and so IndexWriter can throw an exception on such errors).
It seems that this should be a property of Analyzer vs.
just StandardAnalyzer, right?
It can probably be a "policy" property, with two parameters:
1) maxLength, 2) action:
I think this is a good approach -- any objections?
This way, IndexWriter is in-your-face (throws TermTooLongException on
seeing a massive term), but StandardAnalyzer is robust (silently
skips or prefix's the too-long terms).
Mike
Gabi Steinberg wrote:
How about defaulting to a max token
How about defaulting to a max token size of 16K in StandardTokenizer, so
that it never causes an IndexWriter exception, with an option to reduce
that size?
The backward incompatibilty is limited then - tokens exceeding 16K will
NOT causing an IndexWriter exception. In 3.0 we can reduce that d
Gabi Steinberg wrote:
On balance, I think that dropping the document makes sense. I
think Yonik is right in that ensuring that keys are useful - and
indexable - is the tokenizer's job.
StandardTokenizer, in my opinion, should behave similarly to a
person looking at a document and decidin
OK I will take this approach... create TermTooLongException
(subclasses RuntimeException), listed in the javadocs but not the
throws clause of add/updateDocument. DW throws this if it encounters
any term >= 16383 chars in length.
Whenever that exception (or others) are thrown from within
On Dec 20, 2007 2:25 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> Makes sense. I wasn't sure if declaring new exceptions to be thrown
> is violating back-compat. issues or not (even if they are runtime
> exceptions)
That's a good question... I know that declared RuntimeExceptions are
containe
Makes sense. I wasn't sure if declaring new exceptions to be thrown
is violating back-compat. issues or not (even if they are runtime
exceptions)
On Dec 20, 2007, at 1:47 PM, Yonik Seeley wrote:
On Dec 20, 2007 1:36 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
But, I can see the value i
On Dec 20, 2007 1:36 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> But, I can see the value in the throw the exception
> case too, except I think the API should declare the exception is being
> thrown. It could throw an extension of IOException.
To be robust, user indexing code needs to catch
On Dec 20, 2007, at 11:57 AM, Michael McCandless wrote:
Yonik Seeley wrote:
On Dec 20, 2007 11:33 AM, Gabi Steinberg
<[EMAIL PROTECTED]> wrote:
It might be a bit harsh to drop the document if it has a very long
token
in it.
There is really two issues here.
For long tokens, one could ei
On balance, I think that dropping the document makes sense. I think
Yonik is right in that ensuring that keys are useful - and indexable -
is the tokenizer's job.
StandardTokenizer, in my opinion, should behave similarly to a person
looking at a document and deciding which tokens should be in
On Dec 20, 2007 11:57 AM, Michael McCandless <[EMAIL PROTECTED]> wrote:
> Yonik Seeley wrote:
> > On Dec 20, 2007 11:33 AM, Gabi Steinberg
> > <[EMAIL PROTECTED]> wrote:
> >> It might be a bit harsh to drop the document if it has a very long
> >> token
> >> in it.
> >
> > There is really two issues
Yonik Seeley wrote:
On Dec 20, 2007 11:33 AM, Gabi Steinberg
<[EMAIL PROTECTED]> wrote:
It might be a bit harsh to drop the document if it has a very long
token
in it.
There is really two issues here.
For long tokens, one could either ignore them or generate an
exception.
I can see th
On Dec 20, 2007 11:33 AM, Gabi Steinberg <[EMAIL PROTECTED]> wrote:
> It might be a bit harsh to drop the document if it has a very long token
> in it.
There is really two issues here.
For long tokens, one could either ignore them or generate an exception.
For all exceptions generated while index
On Dec 20, 2007, at 10:55 AM, Yonik Seeley wrote:
On Dec 20, 2007 9:41 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
I'm wondering if the IndexWriter should throw an explicit exception
in
this case as opposed to a RuntimeException,
RuntimeExceptions can happen in analysis components durin
Yonik Seeley wrote:
On Dec 20, 2007 11:15 AM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
Though ... we could simply immediately delete the document when any
exception occurs during its processing. So if we think whenever any
doc hits an exception, then it should be deleted, it's not so ha
It might be a bit harsh to drop the document if it has a very long token
in it. I can imagine documents with embedded binary data, where the
text around the binary data is still useful for search.
My feeling is that long tokens (longer than 128 or 256 bytes) are not
useful for search, and sho
On Dec 20, 2007 11:15 AM, Michael McCandless <[EMAIL PROTECTED]> wrote:
> Though ... we could simply immediately delete the document when any
> exception occurs during its processing. So if we think whenever any
> doc hits an exception, then it should be deleted, it's not so hard to
> implement th
Yonik Seeley wrote:
On Dec 20, 2007 9:41 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
I'm wondering if the IndexWriter should throw an explicit
exception in
this case as opposed to a RuntimeException,
RuntimeExceptions can happen in analysis components during indexing
anyway, so it seems
Yonik Seeley wrote:
as it seems to me really
long tokens should be handled more gracefully. It seems strange that
the message says the terms were skipped (which the code does in fact
do), but then there is a RuntimeException thrown which usually
indicates to me the issue is not recoverable.
On Dec 20, 2007 9:41 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> I'm wondering if the IndexWriter should throw an explicit exception in
> this case as opposed to a RuntimeException,
RuntimeExceptions can happen in analysis components during indexing
anyway, so it seems like indexing code shou
On Dec 20, 2007 9:41 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> I am getting the following exception when running against trunk:
> java.lang.IllegalArgumentException: at least one term (length 20079)
> exceeds max term length 16383; these terms were skipped
> at
> org
> .apache.lucene.ind
I am getting the following exception when running against trunk:
java.lang.IllegalArgumentException: at least one term (length 20079)
exceeds max term length 16383; these terms were skipped
at
org
.apache.lucene.index.IndexWriter.checkMaxTermLength(IndexWriter.java:
1545)
at
org.apac
36 matches
Mail list logo