Hello, Shawn Thank you for your response.
Yes. I am sure that I need to preserve "-" in the words. What I want to do is not actually search, it is for a suggestion. "abc-efg" is a dummy sample of our product ID. So, there are several product IDs. such as abc-efg, abc-hij, abc-klm and so on. When a user types "abc", I would like to suggest the above candidates, with hyphens. But also, I would like to do usual suggestions such as: when a user types "com" and I would like to suggest "coming" as well. (Probably my first example sentence is not good ...) I apologize for confusing you, but "!" is not important at all. I will consider WordDelimiterGraphFilter. Again, thank you for your response. Sincerely, Kaya Ota 2020年8月26日(水) 15:57 Shawn Heisey <apa...@elyograg.org>: > On 8/26/2020 12:05 AM, Kayak28 wrote: > > I would like to tokenize the following sentence. I do want to tokens > > that remain hyphens. So, for example, original text: This is a new > > abc-edg and xyz-abc is coming soon! desired output tokens: > > this/is/a/new/abc-edg/and/xyz-abc/is/coming/soon/! Is there any way > > that I do not omit hyphens from tokens? I though HyphenatedWordsFilter > > does have similar functionalities, but it gets rid of hyphens. > > I doubt that filter is what you need. It is fully described in Javadocs: > > > https://lucene.apache.org/core/8_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/HyphenatedWordsFilter.html > > Your requirement is a little odd. Are you SURE that you want to > preserve hyphens like that? > > I think that you could probably achieve it with a carefully configured > WordDelimiterGraphFilter. This filter can be highly customized with its > "types" parameter. This parameter refers to a file in the conf > directory that can change how the filter recognizes certain characters. > I think that if you used the whitespace tokenizer along with the word > delimiter filter, and put the following line into the file referenced by > the "types" parameter, it would do most of what you're after: > > - => ALPHA > > What that config would do is cause the word delimiter filter to treat > the hyphen as an alpha character -- so it will not use it as a > delimiter. One thing about the way it works -- the exclamation point at > the end of your sentence would NOT be emitted as a token as you have > described. If that is critically important, and I cannot imagine that > it would be, you're probably going to want to write your own custom > filter. That would be very much an expert option. > > Thanks, > Shawn > > -- Sincerely, Kaya github: https://github.com/28kayak