RE: jaspq: dashed numerical values tokenized differently

2004-11-03 Thread Daniel Taurat


Mit freundlichen Grüßen
Dr. Daniel Taurat
Senior Consultant

VIP ENTERPRISE 8 | THE POWER OF CONTENT AT WORK


Gauss Interprise AG  Phone:+49-40-3250-1508
Weidestr. 120 a Mobile:+49-173-2418472
D-22083 Hamburg  Germany Fax:+49-40-3250-191508

E-Mail: [EMAIL PROTECTED]
Web:http://www.gaussvip.com


> -Original Message-
> From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> Sent: Mittwoch, 3. November 2004 16:49
> To: Lucene Users List
> Subject: Re: jaspq: dashed numerical values tokenized differently
> 
> On Nov 3, 2004, at 10:21 AM, Daniel Taurat wrote:
> > Checked with Luke on the string
> > dash\-123\-01
> >
> > and got
> >
> > dash
> > 123
> > 01
> >
> > with germanAnalyzer and standardAnalyzer
> > and
> >
> > dash
> >
> > with all the other, except for whitespaceAnalyser, of course.
> >
> >
> > This makes me think that an escaped dash is never a minus, somehow.
> 
> No builtin Analyzer considers backslash an escape character - and most
> consider it a delimiter between tokens and throws it away as you've
> seen.  Only QueryParser has the escape character feature.
> 
>   Erik

Okay, that I understand...

But then, where do the dashes, I mean, the minuses,(**sigh**) anyway, where do they go?

-123 becomes 123 for some (german and standard) and is completely discarded for others 
(russian, simple, stop) and whitespace does its own thing, again 
(-123).

Ahhahh!! now I've got it:
 since
-123go 
becomes 
go for Russian, stop and simple 
but
123go for german and standard
I guess the first group just completely omits numbers, effectively being separators 
(that I checked as well), while the latter only omits the leading minus(dash?).
Grouping is caused by inheritance.

Daniel


 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: jaspq: dashed numerical values tokenized differently

2004-11-03 Thread Erik Hatcher
On Nov 3, 2004, at 10:21 AM, Daniel Taurat wrote:
Checked with Luke on the string
dash\-123\-01
and got
dash
123
01
with germanAnalyzer and standardAnalyzer
and
dash
with all the other, except for whitespaceAnalyser, of course.
This makes me think that an escaped dash is never a minus, somehow.
No builtin Analyzer considers backslash an escape character - and most 
consider it a delimiter between tokens and throws it away as you've 
seen.  Only QueryParser has the escape character feature.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: jaspq: dashed numerical values tokenized differently

2004-11-03 Thread Daniel Taurat


> Give me an example of a string and how you'd like it to be tokenized.
> But first, give the AnalyzerUtils (from my java.net article) a try and
> get a feel for what different analyzers do.
> 
> Keep in mind that it can be tricky (see the AnalysisParalysis page on
> the wiki and my java.net article on QueryParser) to make sense out of
a
> combination of QueryParser and an Analyzer - so its best to work with
> them independently to get what you want and then put things together.

I already used Luke: 
This is what I found (making sense to me even :)))
String dash-123-01
Was tokenized with 1.2 StandardAnalyzer 
dash
123
01

and is tokenized (1.4RC4) with any other than RussianAnalyser,
simpleAnalyzer and StopAnalyzer (which just got dash and omitted all
numbers)

dash-123-01

On the other hand

dash-my-string

is tokenized 

dash
my
string

by all of them except whitespaceAnalyser, of course.

I guess this is what happens: numerical components turn the meaning of
the preceding dash into a minus. With that, it is part of the token with
the digits in it and no longer a separator. This is even for mixed terms
like 123a-01. So -1andAnyOtherCharacters-evenWithDashes is an
non-separable numerical expression for Lucene.

Checked with Luke on the string
dash\-123\-01 

and got

dash
123
01

with germanAnalyzer and standardAnalyzer
and

dash

with all the other, except for whitespaceAnalyser, of course.


This makes me think that an escaped dash is never a minus, somehow.

Daniel





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: jaspq: dashed numerical values tokenized differently

2004-11-03 Thread Erik Hatcher
On Nov 3, 2004, at 8:51 AM, Daniel Taurat wrote:
Now my only question is, why the tokenizing works differently for
strings with numerical components, or if there is a way to make the
standardAnalyzer treat those dashed mixed-characters strings similar to
plain letter-strings.
Give me an example of a string and how you'd like it to be tokenized.  
But first, give the AnalyzerUtils (from my java.net article) a try and 
get a feel for what different analyzers do.

Keep in mind that it can be tricky (see the AnalysisParalysis page on 
the wiki and my java.net article on QueryParser) to make sense out of a 
combination of QueryParser and an Analyzer - so its best to work with 
them independently to get what you want and then put things together.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: jaspq: dashed numerical values tokenized differently

2004-11-03 Thread Daniel Taurat




> -Original Message-
> From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> Sent: Mittwoch, 3. November 2004 13:39
> To: Lucene Users List
> Subject: Re: jaspq: dashed numerical values tokenized differently
> 
> 
> On Nov 3, 2004, at 5:03 AM, Daniel Taurat wrote:
> >> Query parser was changed to treat '-' within words as part of the
> >> word.
> >> Before that change a query 'dash-test' was parsed as 'dash AND NOT
> > test'.
> >> Now QP reads one word 'dash-test' which is analyzed. If the
analyzer
> >> splits that to more than one token (standard analyzer does) a
phrase
> >> query is created.
> >> The difference you see comes from standard analyzer which tokenizes
> >> dash-test dash-123 to tokens dash, test and dash-123.
> >> Prefix queries aren't analyzed.
> >
> > So you say that dash-123 is a prefix query whereas dash-test is not?
> > I found also (with Luke) that dash-anystring123 is not tokenized as
> > well.
> > What exactly are the criteria for Lucene to decide what a prefix is
and
> > what not?
> 
> Anything that ends with an asterisk is parsed as a PrefixQuery, as
long
> as it does not have other wildcard characters.  If it has other
> wildcard characters or the asterisk is not at the end, then it is
> parsed as a WildcardQuery.
> 
>   Erik
> 


Okay, got that. 
Now my only question is, why the tokenizing works differently for
strings with numerical components, or if there is a way to make the
standardAnalyzer treat those dashed mixed-characters strings similar to
plain letter-strings.

Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: jaspq: dashed numerical values tokenized differently

2004-11-03 Thread Erik Hatcher
On Nov 3, 2004, at 5:03 AM, Daniel Taurat wrote:
Query parser was changed to treat '-' within words as part of the 
word.
Before that change a query 'dash-test' was parsed as 'dash AND NOT
test'.
Now QP reads one word 'dash-test' which is analyzed. If the analyzer
splits that to more than one token (standard analyzer does) a phrase
query is created.
The difference you see comes from standard analyzer which tokenizes
dash-test dash-123 to tokens dash, test and dash-123.
Prefix queries aren't analyzed.
So you say that dash-123 is a prefix query whereas dash-test is not?
I found also (with Luke) that dash-anystring123 is not tokenized as
well.
What exactly are the criteria for Lucene to decide what a prefix is and
what not?
Anything that ends with an asterisk is parsed as a PrefixQuery, as long 
as it does not have other wildcard characters.  If it has other 
wildcard characters or the asterisk is not at the end, then it is 
parsed as a WildcardQuery.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: jaspq: dashed numerical values tokenized differently

2004-11-03 Thread Daniel Taurat


-Original Message-
From: Morus Walter [mailto:[EMAIL PROTECTED] 
Sent: Dienstag, 2. November 2004 09:21
To: Lucene Users List
Subject: Re: jaspq: dashed numerical values tokenized differently 

>Daniel Taurat writes:
>> Hi,
>> I have just another stupid parser question:
>> There seems to be a special handling of the dash sign "-" different
from
>> Lucene 1.2 at least in Lucene 1.4.RC3
>> StandardAnalyzer.
>> 
>> Examples (1.4RC3):
>> 
>> A document containing the string "dash-test" is matched by the
following
>> search expressions:
>> dash
>> test
>> dash*
>> dash-test
>> It is _not_ matched by the following search expressions:
>> dash-*
>> dash-t*
>> 
>> If the string after the dash consists of digits, the behavior is
>> different.
>> E.g., a document containing the string "dash-123" is matched by:
>> dash*
>> dash-*
>> dash-123
>> It is not matched by:
>> dash
>> 123
>> 
>> Question:
>> Is this, esp. the different behavior when parsing digits and
characters,
>> intentional and how can it be explained?
>> Regards,
>> 
>Query parser was changed to treat '-' within words as part of the word.
>Before that change a query 'dash-test' was parsed as 'dash AND NOT
test'.
>Now QP reads one word 'dash-test' which is analyzed. If the analyzer
>splits that to more than one token (standard analyzer does) a phrase
>query is created.
>The difference you see comes from standard analyzer which tokenizes
>dash-test dash-123 to tokens dash, test and dash-123.
>Prefix queries aren't analyzed.



So you say that dash-123 is a prefix query whereas dash-test is not?
I found also (with Luke) that dash-anystring123 is not tokenized as
well.
What exactly are the criteria for Lucene to decide what a prefix is and
what not?

Daniel 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: jaspq: dashed numerical values tokenized differently

2004-11-02 Thread Morus Walter
Daniel Taurat writes:
> Hi,
> I have just another stupid parser question:
> There seems to be a special handling of the dash sign "-" different from
> Lucene 1.2 at least in Lucene 1.4.RC3
> StandardAnalyzer.
> 
> Examples (1.4RC3):
> 
> A document containing the string "dash-test" is matched by the following
> search expressions:
> dash
> test
> dash*
> dash-test
> It is _not_ matched by the following search expressions:
> dash-*
> dash-t*
> 
> If the string after the dash consists of digits, the behavior is
> different.
> E.g., a document containing the string "dash-123" is matched by:
> dash*
> dash-*
> dash-123
> It is not matched by:
> dash
> 123
> 
> Question:
> Is this, esp. the different behavior when parsing digits and characters,
> intentional and how can it be explained?
> Regards,
> 
Query parser was changed to treat '-' within words as part of the word.
Before that change a query 'dash-test' was parsed as 'dash AND NOT test'.
Now QP reads one word 'dash-test' which is analyzed. If the analyzer
splits that to more than one token (standard analyzer does) a phrase
query is created.
The difference you see comes from standard analyzer which tokenizes
dash-test dash-123 to tokens dash, test and dash-123.
Prefix queries aren't analyzed.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: jaspq: dashed numerical values tokenized differently

2004-11-01 Thread sergiu gordea
Daniel Taurat wrote:
Hi,
I have just another stupid parser question:
There seems to be a special handling of the dash sign "-" different from
Lucene 1.2 at least in Lucene 1.4.RC3
StandardAnalyzer.
 

From the behaviour you describe I think that the dash sign is removed 
from the text by the analyzer.
This is quite correct because dash is used to separate two words. 
Without its elimination you won't be able to
get the "dash-test" in results if you search for: dash or/and test

I suggest you to use LUKE ... see contributors page in order to see what 
exactly you have in the index, then you will understand
why search is working like that.

Sergiu
Examples (1.4RC3):
A document containing the string "dash-test" is matched by the following
search expressions:
dash
test
dash*
dash-test
It is _not_ matched by the following search expressions:
dash-*
dash-t*
If the string after the dash consists of digits, the behavior is
different.
E.g., a document containing the string "dash-123" is matched by:
dash*
dash-*
dash-123
It is not matched by:
dash
123
Question:
Is this, esp. the different behavior when parsing digits and characters,
intentional and how can it be explained?
Regards,
Daniel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


jaspq: dashed numerical values tokenized differently

2004-11-01 Thread Daniel Taurat
Hi,
I have just another stupid parser question:
There seems to be a special handling of the dash sign "-" different from
Lucene 1.2 at least in Lucene 1.4.RC3
StandardAnalyzer.

Examples (1.4RC3):

A document containing the string "dash-test" is matched by the following
search expressions:
dash
test
dash*
dash-test
It is _not_ matched by the following search expressions:
dash-*
dash-t*

If the string after the dash consists of digits, the behavior is
different.
E.g., a document containing the string "dash-123" is matched by:
dash*
dash-*
dash-123
It is not matched by:
dash
123

Question:
Is this, esp. the different behavior when parsing digits and characters,
intentional and how can it be explained?
Regards,

Daniel





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]