One of the token patterns defined by the StandardTokenizer.jj is this:
NUM: (ALPHANUM P HAS_DIGIT
| HAS_DIGIT P ALPHANUM
| ALPHANUM (P HAS_DIGIT P ALPHANUM)+
| HAS_DIGIT (P ALPHANUM P HAS_DIGIT)+
| ALPHANUM P HAS_DIGIT (P ALPHANUM P HAS_DIGIT)+
| HAS_DIGIT P ALPHANUM (P HAS_DIGIT P ALPHANUM)+
)
So basically if you have some sequences of characters separated by a -
character, sequences that contain a digit will be combined with sequences
which are adjacent to it to form a single token. That explains why the WS
and YYMM sequences got separated out. You can alter this behavior this with
some simple changes to StandardTokenizer.jj.
- Original Message -
From: Iain Young [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Tuesday, December 16, 2003 7:46 AM
Subject: RE: Disabling modifiers?
I think it is a problem with the indexing. I've found another example...
WS-CA-PP00-PROCESS-YYMM
I've looked at the index, and it has been tokenized into 3 words...
WS
CA-PP00-PROCESS
YYMM
Looks as though I might have to use a custom tokenizer as well as an
analyzer then, but any ideas as to why the standard tokenizer would have
split the variable up like this (i.e. why didn't it split the middle bit,
only the word off either end)? The only thing I can think of is that there
are several other variables in the source beginning with WS- or ending
with
-YYMM, so could the tokenizer have seen this and be doing something clever
with them?
Thanks,
Iain
*
* Micro Focus Developer Forum 2004 *
* 3 days that will make a difference *
* www.microfocus.com/devforum *
*
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]