[ 
https://issues.apache.org/jira/browse/LUCENE-7379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908981#comment-16908981
 ] 

Chongchen Chen commented on LUCENE-7379:
----------------------------------------

I think the tokenizer behaves as it should. if you want better Chinese 
tokenization.[Hanlp|https://github.com/hankcs/hanlp-lucene-plugin] is a better 
choice.

> Search word request on Chinese is not working properly
> ------------------------------------------------------
>
>                 Key: LUCENE-7379
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7379
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/queryparser
>    Affects Versions: 5.0
>            Reporter: Alex Simatov
>            Priority: Major
>
> Originally we used Lucene 2.3 in the project for years.
> Some time ago we made an update to the 5.0.0 version of Lucene.
> After that Chinese analyzing stopped working normally (I did not test it on 
> Japanese or Korean)
> We have the following code to process the search request:
> 1. analyzer = new ClassicAnalyzer();
> 2. logger.Write2Log(queryString);
> 3. QueryParser qp = new QueryParser(fieldName, analyzer);
> 4. Query query = qp.parse(queryString);
> 5. logger.Write2Log(query.toString(fieldName));
> 6. int hits = searcher.search(query, 1).totalHits;
> Analyzer on line 1 could be changed by config.
> Line 2 is printing what we put to the Lucene.
> Line 5 is printing how the query modified in Lucene
> Normally we are using the string 打不开~0.7 for 70% or more accuracy and  打不开 to 
> find exact this word.
> ~0.7 functionality was marked as deprecated since 4.0 version, however it is 
> still worked on English at least.
> What was before (on Lucene 2.3):
> Line 2: 打不开~0.7 
> Line 5: 打不开~0.7
> If we provide the correct string for analysis, line 6 returns correct result
> The same for case of 打不开 without accuracy (without ~0.7)
> What is now (on Lucene 5.0):
> Line 2: 打不开~0.7 
> Line 5: 打不开~0
> As I understood it is modifying of deprecated parameter to newly supported 
> one with a little different meaning (at least it is working like I said on 
> English).
> The string for analysis contains the 打不开, however line 6 shows nothing is 
> found.
> Line 2: 打不开 
> Line 5: 打 不 开
> Lucene added spaces, which are interpreted as OR operator. As result Line 6 
> returns that keyword found even if it is only one 不 symbol in the string for 
> analysis.
> The same scenario was tested on the CJKAnalyzer, ClassicAnalyzer  and 
> SmartChineseAnalyzer. Results are the same: neither one of them has the same 
> functionality as analyzer on Lucene 2.3
> Is it known problem in the product? Could you please explain or provide any 
> docs about how the search should work for Chinese in mentioned cases.
> Thanks



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to