Hi Mike,
Sorry for the delay, I was away last week. Now that I get back to it
again my plan is to write a test for the WordDelimiterFilter and
pinpoint the problem.
Cheers,
Roman
On Thu, Aug 20, 2020 at 11:21 AM Michael McCandless
wrote:
>
> Hi Roman,
>
> No need for anyone to be falling on
Hi Roman,
No need for anyone to be falling on swords here! This is really
complicated stuff, no worries. And I think we have a compelling plan to
move forwards so that we can index multi-token synonyms AND have 100%
correct positional queries at search time, thanks to Michael Gibney's cool
Hi Mike,
I'm sorry, the problem all the time is inside related to a
word-delimiter filter factory. This is embarrassing but I have to
admit publicly and self-flagellate.
A word-delimiter filter is used to split tokens, these then are used
to find multi-token synonyms (hence the connection). In
Hi Roman,
Can you share the full exception / stack trace that IndexWriter throws on
that one *'d token in your first example? I thought IndexWriter checks 1)
startOffset >= last token's startOffset, and 2) endOffset >= startOffset
for the current token.
But you seem to be hitting an exception
Hi Mike,
Thanks for the question! And sorry for the delay, I haven't managed to
get to it yesterday. I have generated better output, marked with (*)
where it currently fails the first time and also included one extra
case to illustrate the PositionLength attribute.
assertU(adoc("id", "603",
Hi Roman,
Sorry for the late reply!
I think there remains substantial confusion about multi-token synonyms and
IW's enforcement of offsets. It really is worth thoroughly
iterating/understanding your examples so we can get to the bottom of this.
It looks to me it is possible to emit tokens whose
oh,thanks! that saves everybody some time. I have commented in there,
pleading to be allowed to do something - if that proposal sounds even
little bit reasonable, please consider amplifying the signal
On Mon, Aug 10, 2020 at 4:22 PM David Smiley wrote:
>
> There already is one:
There already is one: https://issues.apache.org/jira/browse/LUCENE-8776
~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley
On Mon, Aug 10, 2020 at 1:30 PM Roman Chyla wrote:
> I'll have to somehow find a solution for this situation, giving up
> offsets
I'll have to somehow find a solution for this situation, giving up
offsets seems like too big a price to pay, I see that overriding
DefaultIndexingChain is not exactly easy -- the only thing I can think
of is to just trick the classloader into giving it a different version
of the chain (praying
I've had a nearly identical experience to what Dave describes, I also chafe
under this restriction.
On Thu, Aug 6, 2020 at 11:07 AM David Smiley wrote:
> I sympathize with your pain, Roman.
>
> It appears we can't really do index-time multi-word synonyms because of
> the offset ordering rule.
I sympathize with your pain, Roman.
It appears we can't really do index-time multi-word synonyms because of the
offset ordering rule. But it's not just synonyms, it's other forms of
multi-token expansion. Where I work, I've seen an interesting approach to
mixed language text analysis in which a
Hi Mike,
Yes, they are not zero offsets - I was instinctively avoiding
"negative offsets"; but they are indeed backward offsets.
Here is the token stream as produced by the analyzer chain indexing
"THE HUBBLE constant: a summary of the hubble space telescope program"
term=hubble pos=2 type=word
Hi Roman,
Hmm, this is all very tricky!
First off, why do you call this "zero offsets"? Isn't it "backwards
offsets" that your analysis chain is trying to produce?
Second, in your first example, if you output the tokens in the right order,
they would not violate the "offsets do not go
Hello devs,
I wanted to create an issue but the helpful message in red letters
reminded me to ask first.
While porting from lucene 6.x to 7x I'm struggling with a change that
was introduced in LUCENE-7626
(https://issues.apache.org/jira/browse/LUCENE-7626)
It is believed that zero offset tokens
14 matches
Mail list logo