Hello, Shawn

Thank you for your response.

Yes. I am sure that I need to preserve "-" in the words.
What I want to do is not actually search, it is for a suggestion.
"abc-efg" is a dummy sample of our product ID.
So, there are several product IDs. such as abc-efg, abc-hij, abc-klm and so
on.
When a user types "abc", I would like to suggest the above candidates, with
hyphens.
But also, I would like to do usual suggestions such as:
when a user types "com" and I would like to suggest "coming" as well.
(Probably my first example sentence is not good ...)

I apologize for confusing you, but "!" is not important at all.

I will consider WordDelimiterGraphFilter.

Again, thank you for your response.

Sincerely,
Kaya Ota


2020年8月26日(水) 15:57 Shawn Heisey <apa...@elyograg.org>:

> On 8/26/2020 12:05 AM, Kayak28 wrote:
> > I would like to tokenize the following sentence. I do want to tokens
> > that remain hyphens. So, for example, original text: This is a new
> > abc-edg and xyz-abc is coming soon! desired output tokens:
> > this/is/a/new/abc-edg/and/xyz-abc/is/coming/soon/! Is there any way
> > that I do not omit hyphens from tokens? I though HyphenatedWordsFilter
> > does have similar functionalities, but it gets rid of hyphens.
>
> I doubt that filter is what you need.  It is fully described in Javadocs:
>
>
> https://lucene.apache.org/core/8_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/HyphenatedWordsFilter.html
>
> Your requirement is a little odd.  Are you SURE that you want to
> preserve hyphens like that?
>
> I think that you could probably achieve it with a carefully configured
> WordDelimiterGraphFilter.  This filter can be highly customized with its
> "types" parameter.  This parameter refers to a file in the conf
> directory that can change how the filter recognizes certain characters.
> I think that if you used the whitespace tokenizer along with the word
> delimiter filter, and put the following line into the file referenced by
> the "types" parameter, it would do most of what you're after:
>
> - => ALPHA
>
> What that config would do is cause the word delimiter filter to treat
> the hyphen as an alpha character -- so it will not use it as a
> delimiter.  One thing about the way it works -- the exclamation point at
> the end of your sentence would NOT be emitted as a token as you have
> described.  If that is critically important, and I cannot imagine that
> it would be, you're probably going to want to write your own custom
> filter.  That would be very much an expert option.
>
> Thanks,
> Shawn
>
>

-- 

Sincerely,
Kaya
github: https://github.com/28kayak

Reply via email to