Re: standard tokenizer seemingly splitting on dot

Mikhail Khludnev Thu, 04 May 2023 03:21:20 -0700

Raised https://github.com/apache/lucene/issues/12264.
Let's look at what devs say.


On Wed, May 3, 2023 at 6:13 PM Bill Tantzen <[email protected]>
wrote:

> Shawn,
> No, email addresses are not preserved -- from the docs:
>
>
>    -
>
>    The "@" character is among the set of token-splitting punctuation, so
>    email addresses are not preserved as single tokens.
>
>
> but the non-split on "test.com" vs the split on "test7.com" is unexpected!
> ~~Bill
>
>
> On Wed, May 3, 2023 at 10:04 AM Shawn Heisey <[email protected]> wrote:
>
> > On 5/2/23 15:30, Bill Tantzen wrote:
> > > This works as I expected:
> > > ab00c.tif -- tokenizes as it should with a value of ab00c.tif
> > >
> > > This doesn't work as I expected
> > > ab003.tif -- tokenizes with a result of ab003 and tif
> >
> > I got the same behavior with ICUTokenizer, which uses ICU4J for Unicode
> > handling.  I am pretty sure ICU4J is IBM's implementation of Unicode.  I
> > think StandardTokenizer is using a different implementation.
> >
> > I'm on Solr 9.3.0-SNAPSHOT ... the ICU analysis components it uses
> > reference icu4j version 70.1, which is dated Oct 28, 2021 on maven
> central.
> >
> > Two different Unicode implementations are doing exactly the same thing.
> > Is it a bug, or expected behavior?  It does mean filenames are sometimes
> > not being handled in the way you expect.
> >
> > I ran another check ... I had thought that StandardTokenizer preserved
> > email addresses as a single token ... but I am seeing that [email protected]
> > is split into two terms.  It splits [email protected] into three terms.
> >
> > Thanks,
> > Shawn
> >
>
>
> --
> Human wheels spin round and round
> While the clock keeps the pace... -- John Mellencamp
> ________________________________________________________________
> Bill Tantzen    University of Minnesota Libraries
> 612-626-9949 (U of M)    612-325-1777 (cell)
>


-- 
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

Re: standard tokenizer seemingly splitting on dot

Reply via email to