Re: Issue with Japanese User Dictionary

2022-01-13 Thread Tomoko Uchida
Hi,

> The only thing that seems to differ is that the characters are full-width
> vs half-width, so I was wondering if this is intended behavior or a bug/too
> restrictive

This is intended behavior. The first column in the user dictionary
must be equal to the concatenated string of the second column in terms
of Unicode codepoint. No normalization such as full-width and
half-width normalization should not be applied (any normalization or
tweak can cause runtime bugs).

2022年1月14日(金) 5:45 Marc D'Mello :
>
> Hi Mike,
>
> Thanks for the response! I'm actually not super familiar with
> UserDictionaries, but looking at the code, it seems like a single line in
> the user provided user dictionary corresponds to a single entry? In that
> case, here is the line (or entry) that does have both widths that I believe
> is causing the problem:
>
> レコーダー,レコーダー,レコーダー,JA名詞
>
> I'm guess here the surface is レコーダー and the concatentated segment is the
> first occurrence of レコーダー. I'm what surface or concatenated segment means
> though, and what it would mean semantically to replace the surface with the
> full width version or the concatenated segment with the half width version.
>
> Thanks,
> Marc
>
>
> On Thu, Jan 13, 2022 at 7:18 AM Michael Sokolov  wrote:
>
> > HI Marc, I wonder if there is a workaround for this issue: eg, could
> > we have entries for both widths? I wonder if there is some interaction
> > with an analysis chain that is doing half-width -> full-width
> > conversion (or vice versa)? I think the UserDictionary has to operate
> > on pre-analyzed tokens ... although maybe *after* char filtering,
> > which presumably could handle width conversions. A bunch of rambling,
> > but maybe the point is - can you share some more information -- what
> > is the full entry in the dictionary that causes the problem?
> >
> > On Wed, Jan 12, 2022 at 7:04 PM Marc D'Mello  wrote:
> > >
> > > Hi,
> > >
> > > I had a question about the Japanese user dictionary. We have a user
> > > dictionary that used to work but after attempting to upgrade Lucene, it
> > > fails with the following error:
> > >
> > > Caused by: java.lang.RuntimeException: Illegal user dictionary entry
> > レコーダー
> > > - the concatenated segmentation (レコーダー) does not match the surface form
> > > (レコーダー)
> > > at
> > >
> > org.apache.lucene.analysis.ja.dict.UserDictionary.(UserDictionary.java:123)
> > >
> > > The specific commit causing this error is here
> > > <
> > https://github.com/apache/lucene/commit/73ba88a50dec64f367caa88d277c26dfd1d8883b#diff-75fd48fadfd3d011e9c34c4310ef66e9009edfbc738fd82deb5661a8edb5c5d9
> > >.
> > > The only thing that seems to differ is that the characters are full-width
> > > vs half-width, so I was wondering if this is intended behavior or a
> > bug/too
> > > restrictive. Any suggestions for fixing this would be greatly
> > appreciated!
> > > Thanks!
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Issue with Japanese User Dictionary

2022-01-13 Thread Marc D'Mello
Hi Mike,

Thanks for the response! I'm actually not super familiar with
UserDictionaries, but looking at the code, it seems like a single line in
the user provided user dictionary corresponds to a single entry? In that
case, here is the line (or entry) that does have both widths that I believe
is causing the problem:

レコーダー,レコーダー,レコーダー,JA名詞

I'm guess here the surface is レコーダー and the concatentated segment is the
first occurrence of レコーダー. I'm what surface or concatenated segment means
though, and what it would mean semantically to replace the surface with the
full width version or the concatenated segment with the half width version.

Thanks,
Marc


On Thu, Jan 13, 2022 at 7:18 AM Michael Sokolov  wrote:

> HI Marc, I wonder if there is a workaround for this issue: eg, could
> we have entries for both widths? I wonder if there is some interaction
> with an analysis chain that is doing half-width -> full-width
> conversion (or vice versa)? I think the UserDictionary has to operate
> on pre-analyzed tokens ... although maybe *after* char filtering,
> which presumably could handle width conversions. A bunch of rambling,
> but maybe the point is - can you share some more information -- what
> is the full entry in the dictionary that causes the problem?
>
> On Wed, Jan 12, 2022 at 7:04 PM Marc D'Mello  wrote:
> >
> > Hi,
> >
> > I had a question about the Japanese user dictionary. We have a user
> > dictionary that used to work but after attempting to upgrade Lucene, it
> > fails with the following error:
> >
> > Caused by: java.lang.RuntimeException: Illegal user dictionary entry
> レコーダー
> > - the concatenated segmentation (レコーダー) does not match the surface form
> > (レコーダー)
> > at
> >
> org.apache.lucene.analysis.ja.dict.UserDictionary.(UserDictionary.java:123)
> >
> > The specific commit causing this error is here
> > <
> https://github.com/apache/lucene/commit/73ba88a50dec64f367caa88d277c26dfd1d8883b#diff-75fd48fadfd3d011e9c34c4310ef66e9009edfbc738fd82deb5661a8edb5c5d9
> >.
> > The only thing that seems to differ is that the characters are full-width
> > vs half-width, so I was wondering if this is intended behavior or a
> bug/too
> > restrictive. Any suggestions for fixing this would be greatly
> appreciated!
> > Thanks!
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Issue with Japanese User Dictionary

2022-01-13 Thread Michael Sokolov
HI Marc, I wonder if there is a workaround for this issue: eg, could
we have entries for both widths? I wonder if there is some interaction
with an analysis chain that is doing half-width -> full-width
conversion (or vice versa)? I think the UserDictionary has to operate
on pre-analyzed tokens ... although maybe *after* char filtering,
which presumably could handle width conversions. A bunch of rambling,
but maybe the point is - can you share some more information -- what
is the full entry in the dictionary that causes the problem?

On Wed, Jan 12, 2022 at 7:04 PM Marc D'Mello  wrote:
>
> Hi,
>
> I had a question about the Japanese user dictionary. We have a user
> dictionary that used to work but after attempting to upgrade Lucene, it
> fails with the following error:
>
> Caused by: java.lang.RuntimeException: Illegal user dictionary entry レコーダー
> - the concatenated segmentation (レコーダー) does not match the surface form
> (レコーダー)
> at
> org.apache.lucene.analysis.ja.dict.UserDictionary.(UserDictionary.java:123)
>
> The specific commit causing this error is here
> .
> The only thing that seems to differ is that the characters are full-width
> vs half-width, so I was wondering if this is intended behavior or a bug/too
> restrictive. Any suggestions for fixing this would be greatly appreciated!
> Thanks!

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Moving from lucene 6.x to 8.x

2022-01-13 Thread Michael Sokolov
I think the "broken offsets" refers to offsets of tokens "going
backwards". Offsets are attributes of tokens that refer back to their
byte position in the original indexed text. Going backwards means -- a
token with a greater position (in the sequence of tokens, or token
graph) should not have a lesser (or maybe it must be strictly
increasing I forget) offset. If you use term vectors, and have these
broken offsets, which should not but do often occur with custom
analysis chains, this could be a problem.

On Wed, Jan 12, 2022 at 12:36 AM Rahul Goswami  wrote:
>
> Thanks Vinay for the link to Erick's talk! I hadn't seen it and I must
> admit it did help put a few things into perspective.
>
> I was able to track down the JIRAs (thank you 'git blame')
> surrounding/leading up to this architectural decision and the linked
> patches:
> https://issues.apache.org/jira/browse/LUCENE-7703  (Record the version that
> was used at index creation time)
> https://issues.apache.org/jira/browse/LUCENE-7730  (Better encode length
> normalization in similarities)
> https://issues.apache.org/jira/browse/LUCENE-7837  (Use
> indexCreatedVersionMajor to fail opening too old indices)
>
> From these JIRAs what I was able to piece together is that if not
> reindexed, relevance scoring might act in unpredictable ways. For my use
> case, I can live with that since we provide an explicit sort on one or more
> fields.
>
> In LUCENE-7703, Adrien says "we will reject broken offsets in term vectors
> as of 7.0". So my questions to the community are
> i) What are these offsets, and what feature/s might break with respect to
> these offsets if not reindexed?
> ii) Do the length normalization changes in  LUCENE-7730 affect only
> relevance scores?
>
> I understand I could be playing with fire here, but reindexing is not a
> practical solution for my situation. At least not in the near future until
> I figure out a more seamless way of reindexing with minimal downtime given
> that there are multiple 1TB+ indexes. Would appreciate inputs from the dev
> community on this.
>
> Thanks,
> Rahul
>
> On Sun, Jan 9, 2022 at 2:41 PM Vinay Rajput 
> wrote:
>
> > Hi Rahul,
> >
> > I am not an expert so someone else might provide a better answer. However,
> > I remember
> > @Erick briefly talked about this restriction in one of his talks here:-
> > https://www.youtube.com/watch?v=eaQBH_H3d3g=621s (not sure if you have
> > seen it already).
> >
> > As he explains, earlier it looked like IndexUpgrader tool was doing the job
> > perfectly but it wasn't always the case. There is no guarantee that after
> > using the IndexUpgrader tool, your 8.x index will keep all of the
> > characteristics of lucene 8. There can be some situations (e.g. incorrect
> > offset) where you might get an incorrect relevance score which might be
> > difficult to trace and debug. So, Lucene developers now made it explicit
> > that what people were doing earlier was not ideal, and they should now plan
> > to reindex all the documents during the major upgrade.
> >
> > Having said that, what you have done can just work without any issue as
> > long as you don't encounter any odd sorting behavior. This may/may not be
> > super critical depending on the business use case and that is where you
> > might need to make a decision.
> >
> > Thanks,
> > Vinay
> >
> > On Sat, Jan 8, 2022 at 10:27 PM Rahul Goswami 
> > wrote:
> >
> > > Hello,
> > > Would appreciate any insights on the issue.Are there any backward
> > > incompatible changes in 8.x index because of which the lucene upgrader is
> > > unable to upgrade any index EVER touched by <= 6.x ? Or is the
> > restriction
> > > more of a safety net at this point for possible future incompatibilities
> > ?
> > >
> > > Thanks,
> > > Rahul
> > >
> > > On Thu, Jan 6, 2022 at 11:46 PM Rahul Goswami 
> > > wrote:
> > >
> > > > Hello,
> > > > I am using Apache Solr 7.7.2 with indexes which were originally created
> > > on
> > > > 4.8 and upgraded ever since. I recently tried upgrading to 8.x using
> > the
> > > > lucene IndexUpgrader tool and the upgrade fails. I know that lucene 8.x
> > > > prevents opening any segment which was touched by <= 6.x at any point
> > in
> > > > the past. I also know the general recommendation is to reindex upon
> > > > migration to another major release, however it is not always feasible.
> > > >
> > > > So I tried to remove the check for LATEST-1 in SegmentInfos.java (
> > > >
> > >
> > https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L321
> > > )
> > > > and also checked for other references to IndexFormatTooOldException.
> > > Turns
> > > > out that removing this check and rebuilding lucene-core lets the
> > upgrade
> > > go
> > > > through fine. I ran a full sequence of index upgrades from 5.x -> 6.x
> > ->
> > > > 7.x ->8.x. which went through fine. Also search/update operations work
> > > > without any issues in 8.x.
> > > >
> > 

Re: Migration from Lucene 5.5 to 8.11.1

2022-01-13 Thread András Péteri
It looks like Sascha runs IndexUpgrader for all major versions, ie. 6.6.6,
7.7.3 and 8.11.1. File "segments_91" is written by the 7.7.3 run
immediately before the error.

On Wed, Jan 12, 2022 at 3:44 PM Adrien Grand  wrote:

> The log says what the problem is: version 8.11.1 cannot read indices
> created by Lucene 5.5, you will need to reindex your data.
>
> On Wed, Jan 12, 2022 at 3:41 PM  wrote:
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
> --
> Adrien
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
-- 
András