subject:"\[jira\] \[Comment Edited\] \(LUCENE\-8933\) JapaneseTokenizer creates Token objects with corrupt offsets"

[jira] [Comment Edited] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-07-27 Thread Tomoko Uchida (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894298#comment-16894298
 ] 

Tomoko Uchida edited comment on LUCENE-8933 at 7/27/19 6:33 AM:


{quote}Should we go further and check that the concatenation of the segments is 
equal to the surface form?
{quote}
+1

I will open two pull requests.
 - For master branch: Add a equality check between surface form and its 
segmentation.
 - For 8x branch: Add a length check to avoid annoying runtime exceptions, in 
other words, "if the (concatenated) segment is longer than its surface form, 
throw exception when loading a user dictionary".


was (Author: tomoko uchida):
{quote}Should we go further and check that the concatenation of the segments is 
equal to the surface form?
{quote}
+1

I will open two pull requests.
 - For master branch: Add a equality check between surface form and its 
segmentation.
 - For 8x branch: Add a length check to avoid annoying runtime exceptions, in 
other words, "if the segmentation is longer than surface form, throw exception 
when loading a user dictionary".

> JapaneseTokenizer creates Token objects with corrupt offsets
> 
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
> at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
> at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-07-24 Thread Tomoko Uchida (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891940#comment-16891940
 ] 

Tomoko Uchida edited comment on LUCENE-8933 at 7/24/19 3:15 PM:


I once encountered similar errors but it was not related to synonyms. I guess 
that there might be surrogate pair (perhaps Emoji) characters?

If whole analysis chain and ideally the text string which cause the error are 
provided, I will look into it. (Otherwise, it seems to be difficult to 
reproduce the error...)

 


was (Author: tomoko uchida):
I once encountered similar errors but it's not related to synonyms. I guess 
that there might be surrogate pair (perhaps Emoji) characters?

If whole analysis chain and ideally the text string which cause the error are 
provided, I will look into it. (Otherwise, it seems to be difficult to 
reproduce the error...)

 

> JapaneseTokenizer creates Token objects with corrupt offsets
> 
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
> at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
> at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

[jira] [Comment Edited] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2 matches

Site Navigation

Mail list logo

Footer information