RE: jaspq: dashed numerical values tokenized differently
Mit freundlichen Grüßen Dr. Daniel Taurat Senior Consultant VIP ENTERPRISE 8 | THE POWER OF CONTENT AT WORK Gauss Interprise AG Phone:+49-40-3250-1508 Weidestr. 120 a Mobile:+49-173-2418472 D-22083 Hamburg Germany Fax:+49-40-3250-191508 E-Mail: [EMAIL PROTECTED] Web:http://www.gaussvip.com > -Original Message- > From: Erik Hatcher [mailto:[EMAIL PROTECTED] > Sent: Mittwoch, 3. November 2004 16:49 > To: Lucene Users List > Subject: Re: jaspq: dashed numerical values tokenized differently > > On Nov 3, 2004, at 10:21 AM, Daniel Taurat wrote: > > Checked with Luke on the string > > dash\-123\-01 > > > > and got > > > > dash > > 123 > > 01 > > > > with germanAnalyzer and standardAnalyzer > > and > > > > dash > > > > with all the other, except for whitespaceAnalyser, of course. > > > > > > This makes me think that an escaped dash is never a minus, somehow. > > No builtin Analyzer considers backslash an escape character - and most > consider it a delimiter between tokens and throws it away as you've > seen. Only QueryParser has the escape character feature. > > Erik Okay, that I understand... But then, where do the dashes, I mean, the minuses,(**sigh**) anyway, where do they go? -123 becomes 123 for some (german and standard) and is completely discarded for others (russian, simple, stop) and whitespace does its own thing, again (-123). Ahhahh!! now I've got it: since -123go becomes go for Russian, stop and simple but 123go for german and standard I guess the first group just completely omits numbers, effectively being separators (that I checked as well), while the latter only omits the leading minus(dash?). Grouping is caused by inheritance. Daniel > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: jaspq: dashed numerical values tokenized differently
On Nov 3, 2004, at 10:21 AM, Daniel Taurat wrote: Checked with Luke on the string dash\-123\-01 and got dash 123 01 with germanAnalyzer and standardAnalyzer and dash with all the other, except for whitespaceAnalyser, of course. This makes me think that an escaped dash is never a minus, somehow. No builtin Analyzer considers backslash an escape character - and most consider it a delimiter between tokens and throws it away as you've seen. Only QueryParser has the escape character feature. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: jaspq: dashed numerical values tokenized differently
> Give me an example of a string and how you'd like it to be tokenized. > But first, give the AnalyzerUtils (from my java.net article) a try and > get a feel for what different analyzers do. > > Keep in mind that it can be tricky (see the AnalysisParalysis page on > the wiki and my java.net article on QueryParser) to make sense out of a > combination of QueryParser and an Analyzer - so its best to work with > them independently to get what you want and then put things together. I already used Luke: This is what I found (making sense to me even :))) String dash-123-01 Was tokenized with 1.2 StandardAnalyzer dash 123 01 and is tokenized (1.4RC4) with any other than RussianAnalyser, simpleAnalyzer and StopAnalyzer (which just got dash and omitted all numbers) dash-123-01 On the other hand dash-my-string is tokenized dash my string by all of them except whitespaceAnalyser, of course. I guess this is what happens: numerical components turn the meaning of the preceding dash into a minus. With that, it is part of the token with the digits in it and no longer a separator. This is even for mixed terms like 123a-01. So -1andAnyOtherCharacters-evenWithDashes is an non-separable numerical expression for Lucene. Checked with Luke on the string dash\-123\-01 and got dash 123 01 with germanAnalyzer and standardAnalyzer and dash with all the other, except for whitespaceAnalyser, of course. This makes me think that an escaped dash is never a minus, somehow. Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: jaspq: dashed numerical values tokenized differently
On Nov 3, 2004, at 8:51 AM, Daniel Taurat wrote: Now my only question is, why the tokenizing works differently for strings with numerical components, or if there is a way to make the standardAnalyzer treat those dashed mixed-characters strings similar to plain letter-strings. Give me an example of a string and how you'd like it to be tokenized. But first, give the AnalyzerUtils (from my java.net article) a try and get a feel for what different analyzers do. Keep in mind that it can be tricky (see the AnalysisParalysis page on the wiki and my java.net article on QueryParser) to make sense out of a combination of QueryParser and an Analyzer - so its best to work with them independently to get what you want and then put things together. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: jaspq: dashed numerical values tokenized differently
> -Original Message- > From: Erik Hatcher [mailto:[EMAIL PROTECTED] > Sent: Mittwoch, 3. November 2004 13:39 > To: Lucene Users List > Subject: Re: jaspq: dashed numerical values tokenized differently > > > On Nov 3, 2004, at 5:03 AM, Daniel Taurat wrote: > >> Query parser was changed to treat '-' within words as part of the > >> word. > >> Before that change a query 'dash-test' was parsed as 'dash AND NOT > > test'. > >> Now QP reads one word 'dash-test' which is analyzed. If the analyzer > >> splits that to more than one token (standard analyzer does) a phrase > >> query is created. > >> The difference you see comes from standard analyzer which tokenizes > >> dash-test dash-123 to tokens dash, test and dash-123. > >> Prefix queries aren't analyzed. > > > > So you say that dash-123 is a prefix query whereas dash-test is not? > > I found also (with Luke) that dash-anystring123 is not tokenized as > > well. > > What exactly are the criteria for Lucene to decide what a prefix is and > > what not? > > Anything that ends with an asterisk is parsed as a PrefixQuery, as long > as it does not have other wildcard characters. If it has other > wildcard characters or the asterisk is not at the end, then it is > parsed as a WildcardQuery. > > Erik > Okay, got that. Now my only question is, why the tokenizing works differently for strings with numerical components, or if there is a way to make the standardAnalyzer treat those dashed mixed-characters strings similar to plain letter-strings. Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: jaspq: dashed numerical values tokenized differently
On Nov 3, 2004, at 5:03 AM, Daniel Taurat wrote: Query parser was changed to treat '-' within words as part of the word. Before that change a query 'dash-test' was parsed as 'dash AND NOT test'. Now QP reads one word 'dash-test' which is analyzed. If the analyzer splits that to more than one token (standard analyzer does) a phrase query is created. The difference you see comes from standard analyzer which tokenizes dash-test dash-123 to tokens dash, test and dash-123. Prefix queries aren't analyzed. So you say that dash-123 is a prefix query whereas dash-test is not? I found also (with Luke) that dash-anystring123 is not tokenized as well. What exactly are the criteria for Lucene to decide what a prefix is and what not? Anything that ends with an asterisk is parsed as a PrefixQuery, as long as it does not have other wildcard characters. If it has other wildcard characters or the asterisk is not at the end, then it is parsed as a WildcardQuery. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: jaspq: dashed numerical values tokenized differently
-Original Message- From: Morus Walter [mailto:[EMAIL PROTECTED] Sent: Dienstag, 2. November 2004 09:21 To: Lucene Users List Subject: Re: jaspq: dashed numerical values tokenized differently >Daniel Taurat writes: >> Hi, >> I have just another stupid parser question: >> There seems to be a special handling of the dash sign "-" different from >> Lucene 1.2 at least in Lucene 1.4.RC3 >> StandardAnalyzer. >> >> Examples (1.4RC3): >> >> A document containing the string "dash-test" is matched by the following >> search expressions: >> dash >> test >> dash* >> dash-test >> It is _not_ matched by the following search expressions: >> dash-* >> dash-t* >> >> If the string after the dash consists of digits, the behavior is >> different. >> E.g., a document containing the string "dash-123" is matched by: >> dash* >> dash-* >> dash-123 >> It is not matched by: >> dash >> 123 >> >> Question: >> Is this, esp. the different behavior when parsing digits and characters, >> intentional and how can it be explained? >> Regards, >> >Query parser was changed to treat '-' within words as part of the word. >Before that change a query 'dash-test' was parsed as 'dash AND NOT test'. >Now QP reads one word 'dash-test' which is analyzed. If the analyzer >splits that to more than one token (standard analyzer does) a phrase >query is created. >The difference you see comes from standard analyzer which tokenizes >dash-test dash-123 to tokens dash, test and dash-123. >Prefix queries aren't analyzed. So you say that dash-123 is a prefix query whereas dash-test is not? I found also (with Luke) that dash-anystring123 is not tokenized as well. What exactly are the criteria for Lucene to decide what a prefix is and what not? Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: jaspq: dashed numerical values tokenized differently
Daniel Taurat writes: > Hi, > I have just another stupid parser question: > There seems to be a special handling of the dash sign "-" different from > Lucene 1.2 at least in Lucene 1.4.RC3 > StandardAnalyzer. > > Examples (1.4RC3): > > A document containing the string "dash-test" is matched by the following > search expressions: > dash > test > dash* > dash-test > It is _not_ matched by the following search expressions: > dash-* > dash-t* > > If the string after the dash consists of digits, the behavior is > different. > E.g., a document containing the string "dash-123" is matched by: > dash* > dash-* > dash-123 > It is not matched by: > dash > 123 > > Question: > Is this, esp. the different behavior when parsing digits and characters, > intentional and how can it be explained? > Regards, > Query parser was changed to treat '-' within words as part of the word. Before that change a query 'dash-test' was parsed as 'dash AND NOT test'. Now QP reads one word 'dash-test' which is analyzed. If the analyzer splits that to more than one token (standard analyzer does) a phrase query is created. The difference you see comes from standard analyzer which tokenizes dash-test dash-123 to tokens dash, test and dash-123. Prefix queries aren't analyzed. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: jaspq: dashed numerical values tokenized differently
Daniel Taurat wrote: Hi, I have just another stupid parser question: There seems to be a special handling of the dash sign "-" different from Lucene 1.2 at least in Lucene 1.4.RC3 StandardAnalyzer. From the behaviour you describe I think that the dash sign is removed from the text by the analyzer. This is quite correct because dash is used to separate two words. Without its elimination you won't be able to get the "dash-test" in results if you search for: dash or/and test I suggest you to use LUKE ... see contributors page in order to see what exactly you have in the index, then you will understand why search is working like that. Sergiu Examples (1.4RC3): A document containing the string "dash-test" is matched by the following search expressions: dash test dash* dash-test It is _not_ matched by the following search expressions: dash-* dash-t* If the string after the dash consists of digits, the behavior is different. E.g., a document containing the string "dash-123" is matched by: dash* dash-* dash-123 It is not matched by: dash 123 Question: Is this, esp. the different behavior when parsing digits and characters, intentional and how can it be explained? Regards, Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
jaspq: dashed numerical values tokenized differently
Hi, I have just another stupid parser question: There seems to be a special handling of the dash sign "-" different from Lucene 1.2 at least in Lucene 1.4.RC3 StandardAnalyzer. Examples (1.4RC3): A document containing the string "dash-test" is matched by the following search expressions: dash test dash* dash-test It is _not_ matched by the following search expressions: dash-* dash-t* If the string after the dash consists of digits, the behavior is different. E.g., a document containing the string "dash-123" is matched by: dash* dash-* dash-123 It is not matched by: dash 123 Question: Is this, esp. the different behavior when parsing digits and characters, intentional and how can it be explained? Regards, Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]