Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

2020-08-25 Thread Roman Chyla
Hi Mike, Sorry for the delay, I was away last week. Now that I get back to it again my plan is to write a test for the WordDelimiterFilter and pinpoint the problem. Cheers, Roman On Thu, Aug 20, 2020 at 11:21 AM Michael McCandless wrote: > > Hi Roman, > > No need for anyone to be falling on

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

2020-08-20 Thread Michael McCandless
Hi Roman, No need for anyone to be falling on swords here! This is really complicated stuff, no worries. And I think we have a compelling plan to move forwards so that we can index multi-token synonyms AND have 100% correct positional queries at search time, thanks to Michael Gibney's cool

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

2020-08-18 Thread Roman Chyla
Hi Mike, I'm sorry, the problem all the time is inside related to a word-delimiter filter factory. This is embarrassing but I have to admit publicly and self-flagellate. A word-delimiter filter is used to split tokens, these then are used to find multi-token synonyms (hence the connection). In

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

2020-08-17 Thread Michael McCandless
Hi Roman, Can you share the full exception / stack trace that IndexWriter throws on that one *'d token in your first example? I thought IndexWriter checks 1) startOffset >= last token's startOffset, and 2) endOffset >= startOffset for the current token. But you seem to be hitting an exception

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

2020-08-14 Thread Roman Chyla
Hi Mike, Thanks for the question! And sorry for the delay, I haven't managed to get to it yesterday. I have generated better output, marked with (*) where it currently fails the first time and also included one extra case to illustrate the PositionLength attribute. assertU(adoc("id", "603",

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

2020-08-12 Thread Michael McCandless
Hi Roman, Sorry for the late reply! I think there remains substantial confusion about multi-token synonyms and IW's enforcement of offsets. It really is worth thoroughly iterating/understanding your examples so we can get to the bottom of this. It looks to me it is possible to emit tokens whose

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

2020-08-10 Thread Roman Chyla
oh,thanks! that saves everybody some time. I have commented in there, pleading to be allowed to do something - if that proposal sounds even little bit reasonable, please consider amplifying the signal On Mon, Aug 10, 2020 at 4:22 PM David Smiley wrote: > > There already is one:

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

2020-08-10 Thread David Smiley
There already is one: https://issues.apache.org/jira/browse/LUCENE-8776 ~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley On Mon, Aug 10, 2020 at 1:30 PM Roman Chyla wrote: > I'll have to somehow find a solution for this situation, giving up > offsets

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

2020-08-10 Thread Roman Chyla
I'll have to somehow find a solution for this situation, giving up offsets seems like too big a price to pay, I see that overriding DefaultIndexingChain is not exactly easy -- the only thing I can think of is to just trick the classloader into giving it a different version of the chain (praying

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

2020-08-06 Thread Gus Heck
I've had a nearly identical experience to what Dave describes, I also chafe under this restriction. On Thu, Aug 6, 2020 at 11:07 AM David Smiley wrote: > I sympathize with your pain, Roman. > > It appears we can't really do index-time multi-word synonyms because of > the offset ordering rule.

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

2020-08-06 Thread David Smiley
I sympathize with your pain, Roman. It appears we can't really do index-time multi-word synonyms because of the offset ordering rule. But it's not just synonyms, it's other forms of multi-token expansion. Where I work, I've seen an interesting approach to mixed language text analysis in which a

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

2020-08-06 Thread Roman Chyla
Hi Mike, Yes, they are not zero offsets - I was instinctively avoiding "negative offsets"; but they are indeed backward offsets. Here is the token stream as produced by the analyzer chain indexing "THE HUBBLE constant: a summary of the hubble space telescope program" term=hubble pos=2 type=word

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

2020-08-06 Thread Michael McCandless
Hi Roman, Hmm, this is all very tricky! First off, why do you call this "zero offsets"? Isn't it "backwards offsets" that your analysis chain is trying to produce? Second, in your first example, if you output the tokens in the right order, they would not violate the "offsets do not go

When zero offsets are not bad - a.k.a. multi-token synonyms yet again

2020-08-05 Thread Roman Chyla
Hello devs, I wanted to create an issue but the helpful message in red letters reminded me to ask first. While porting from lucene 6.x to 7x I'm struggling with a change that was introduced in LUCENE-7626 (https://issues.apache.org/jira/browse/LUCENE-7626) It is believed that zero offset tokens