Re: Disabling modifiers?

2003-12-16 Thread Karl Penney
One of the token patterns defined by the StandardTokenizer.jj is this:
NUM: (ALPHANUM P HAS_DIGIT

| HAS_DIGIT P ALPHANUM

| ALPHANUM (P HAS_DIGIT P ALPHANUM)+

| HAS_DIGIT (P ALPHANUM P HAS_DIGIT)+

| ALPHANUM P HAS_DIGIT (P ALPHANUM P HAS_DIGIT)+

| HAS_DIGIT P ALPHANUM (P HAS_DIGIT P ALPHANUM)+

)

So basically if you have some sequences of characters separated by a -
character, sequences that contain a digit will be combined with sequences
which are adjacent to it to form a single token.  That explains why the WS
and YYMM sequences got separated out.  You can alter this behavior this with
some simple changes to StandardTokenizer.jj.

- Original Message -
From: Iain Young [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Tuesday, December 16, 2003 7:46 AM
Subject: RE: Disabling modifiers?


 I think it is a problem with the indexing. I've found another example...

 WS-CA-PP00-PROCESS-YYMM

 I've looked at the index, and it has been tokenized into 3 words...

 WS
 CA-PP00-PROCESS
 YYMM

 Looks as though I might have to use a custom tokenizer as well as an
 analyzer then, but any ideas as to why the standard tokenizer would have
 split the variable up like this (i.e. why didn't it split the middle bit,
 only the word off either end)? The only thing I can think of is that there
 are several other variables in the source beginning with WS- or ending
with
 -YYMM, so could the tokenizer have seen this and be doing something clever
 with them?

 Thanks,
 Iain

 *
 *  Micro Focus Developer Forum 2004 *
 *  3 days that will make a difference   *
 *  www.microfocus.com/devforum  *
 *



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to get TokenStream from Field?

2003-12-16 Thread Karl Penney
Is there any way to get a TokenStream for a given Field of a Document (is that 
information even stored in the index)?  I want to use the startOffset / endOffset 
information for hit highlighting.  Do I have to tokenize the text value for the field 
again to get this information?