Re: Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory

2017-11-24 Thread Zheng Lin Edwin Yeo
Hi Ahmet, Ok. Thanks for your advice. Regards, Edwin On 25 November 2017 at 10:23, Ahmet Arslan wrote: > > > Hi Zheng, > > UAX29UET recognizes URLs and e-mails. It does not tokenize them. It keeps > them single token. > > StandardTokenizer produce two or more tokens for an

Re: Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory

2017-11-24 Thread Zheng Lin Edwin Yeo
Hi Rick, For both of the tokenizers, it does not split on the hyphens for email like this: solr-user@lucene.apache.org The entire email address remains intact for both of the tokenizers. Regards, Edwin On 24 November 2017 at 20:19, Rick Leir wrote: > Edwin > There is a

Re: Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory

2017-11-24 Thread Ahmet Arslan
Hi Zheng, UAX29UET recognizes URLs and e-mails. It does not tokenize them. It keeps them single token. StandardTokenizer produce two or more tokens for an entity. Please try them using the analysis page, use which one suits your requirements. Ahmet On Friday, November 24, 2017, 11:46:57

Re: Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory

2017-11-24 Thread Rick Leir
Edwin There is a spec for which characters are acceptable in an email name, and another spec for chars in a domain name. I suspect you will have more success with a tokenizer which is specialized for email, but I have not looked at UAX29URLEmailTokenizerFactory. Does ClassicTokenizerFactory

Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory

2017-11-24 Thread Zheng Lin Edwin Yeo
Hi, I am indexing email addresses into Solr via EML files. Currently, I am using ClassicTokenizerFactory with LowerCaseFilterFactory. However, I also found that we can also use UAX29URLEmailTokenizerFactory with LowerCaseFilterFactory. Does anyone have any recommendation on which Tokenizer is