Re: MultiCollector collect behavior is changed

2018-07-03 Thread Yonghui Zhao
Thanks Adrien, got it. 2018-07-04 13:46 GMT+08:00 Adrien Grand : > This was considered a bug as the need to early-terminate is a per-collector > decision. If you want to do something like that again, you could fork > MultiCollector and propagate CollectionTerminatedExceptions. > > Le mer. 4 juil.

Re: MultiCollector collect behavior is changed

2018-07-03 Thread Adrien Grand
This was considered a bug as the need to early-terminate is a per-collector decision. If you want to do something like that again, you could fork MultiCollector and propagate CollectionTerminatedExceptions. Le mer. 4 juil. 2018 à 05:34, Yonghui Zhao a écrit : > In lucene 4.10, > If one collector

MultiCollector collect behavior is changed

2018-07-03 Thread Yonghui Zhao
In lucene 4.10, If one collector throw CollectionTerminatedException, all collectors are terminated. In lucene 7.2.1, CollectionTerminatedException will only terminate current collector, the others won't be terminated. How to keep old behavior?

[ANNOUNCE] Apache Lucene 6.6.5 released

2018-07-03 Thread Ishan Chattopadhyaya
03 July 2018, Apache Lucene™ 6.6.5 available The Lucene PMC is pleased to announce the release of Apache Lucene 6.6.5. Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-

Re: WordDelimiterGraphFilter swallows emojis

2018-07-03 Thread Michael Sokolov
Ah I see -- there is \p{Emoji} to start with, which is nice, but also this extended pictographic -- I'll read more, and get back if I have questions. Might be a little while before I dig in to this though. Thanks again On Tue, Jul 3, 2018 at 11:25 AM Robert Muir wrote: > If you customized the ru

Re: WordDelimiterGraphFilter swallows emojis

2018-07-03 Thread Robert Muir
If you customized the rules, maybe have a look at https://issues.apache.org/jira/browse/LUCENE-8366 The rules got simpler and we also updated the customization example used for the factory's test. On Tue, Jul 3, 2018 at 10:46 AM, Michael Sokolov wrote: > Yes that sounds good -- this ConditionalT

Re: WordDelimiterGraphFilter swallows emojis

2018-07-03 Thread Michael Sokolov
Yes that sounds good -- this ConditionalTokenFilter is going to be very helpful. We have overridden the ICUTokenizer's rbbi rules, but I'll poke around and see about incorporating the emoji rules from there. Thanks Robert On Tue, Jul 3, 2018 at 9:28 AM Robert Muir wrote: > > Any thoughts? > > b

Re: WordDelimiterGraphFilter swallows emojis

2018-07-03 Thread Michael Sokolov
Thanks for the pointer On Tue, Jul 3, 2018 at 9:04 AM julien Blaize wrote: > Hello Michael, > > i had previously worked on emoji detection with lucene. > > I had to extends the Tokenizer class (and not the TokenFilter like > WordDelimiterFilter) to preserve the delimiter attribute. > I also had

Re: WordDelimiterGraphFilter swallows emojis

2018-07-03 Thread Robert Muir
> Any thoughts? best idea I have would be to tokenize with ICUTokenizer, which will tag emoji sequences as "" token type, then use ConditionalTokenFilter to send all tokens EXCEPT those with token type of "" to your WordDelimiterFilter. This way WordDelimiterFilter never sees the emoji at all and

Re: WordDelimiterGraphFilter swallows emojis

2018-07-03 Thread Robert Muir
On Tue, Jul 3, 2018 at 8:00 AM, Michael Sokolov wrote: > WDGF (and WordDelimiterFilter) treat emoji as "SUBWORD_DELIM" characters > like punctuation and thus remove them, but we would like to be able to > search for emoji and use this filter for handling dashes, dots and other > intra-word punctua

Re: WordDelimiterGraphFilter swallows emojis

2018-07-03 Thread julien Blaize
Hello Michael, i had previously worked on emoji detection with lucene. I had to extends the Tokenizer class (and not the TokenFilter like WordDelimiterFilter) to preserve the delimiter attribute. I also had to keep track of consecutive delimiters in the character stream because Lucene default imp

WordDelimiterGraphFilter swallows emojis

2018-07-03 Thread Michael Sokolov
WDGF (and WordDelimiterFilter) treat emoji as "SUBWORD_DELIM" characters like punctuation and thus remove them, but we would like to be able to search for emoji and use this filter for handling dashes, dots and other intra-word punctuation. These filters identify non-word and non-digit characters