Re: Highlighting content field problem when using JiebaTokenizerFactory

Zheng Lin Edwin Yeo Thu, 22 Oct 2015 07:23:28 -0700

Hi Scott,

Thank you for your response and suggestions.


With respond to your questions, here are the answers:

1. I take a look at Jieba. It uses a dictionary and it seems to do a good
job on CJK. I doubt this problem may be from those filters (note: I can
understand you may use CJKWidthFilter to convert Japanese but doesn't
understand why you use CJKBigramFilter and EdgeNGramFilter). Have you tried
commenting out those filters, say leave only Jieba and StopFilter, and see
if this problem disppears?
*A) Yes, I have tried commenting out the other filters and only left with
Jieba and StopFilter. The problem is still there.*

2.Does this problem occur only on Chinese search words? Does it happen on
English search words?
*A) Yes, the same problem occurs on English words. For example, when I
search for "word", it will highlight in this way: <em> wor<em>d*

3.To use FastVectorHighlighter, you seem to have to enable 3 term*
parameters in field declaration? I see only one is enabled. Please refer to
the answer in this stackoverflow question:
http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
*A) I have tried to enable all 3 terms in the FastVectorHighlighter too,
but the same problem persists as well.*


Regards,
Edwin


On 22 October 2015 at 16:25, Scott Chu <scott....@udngroup.com> wrote:

> Hi solr-user,
>
> Can't judge the cause on fast glimpse of your definition but some
> suggestions I can give:
>
> 1. I take a look at Jieba. It uses a dictionary and it seems to do a good
> job on CJK. I doubt this problem may be from those filters (note: I can
> understand you may use CJKWidthFilter to convert Japanese but doesn't
> understand why you use CJKBigramFilter and EdgeNGramFilter). Have you tried
> commenting out those filters, say leave only Jieba and StopFilter, and see
> if this problem disppears?
>
> 2.Does this problem occur only on Chinese search words? Does it happen on
> English search words?
>
> 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> parameters in field declaration? I see only one is enabled. Please refer to
> the answer in this stackoverflow question:
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
>
>
> Scott Chu，scott....@udngroup.com
> 2015/10/22
>
> ----- Original Message -----
> *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> *To: *solr-user <solr-user@lucene.apache.org>
> *Date: *2015-10-20, 12:04:11
> *Subject: *Re: Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi Scott,
>
> Here's my schema.xml for content and title, which uses text_chinese. The
> problem only occurs in content, and not in title.
>
> <field name="content" type="text_chinese" indexed="true" stored="true"
> omitNorms="true" termVectors="true"/>
>    <field name="title" type="text_chinese" indexed="true" stored="true"
> omitNorms="true" termVectors="true"/>
>
>
>   <fieldType name="text_chinese" class="solr.TextField"
> positionIncrementGap="100">
>  <analyzer type="index">
> <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
>  segMode="SEARCH"/>
> <filter class="solr.CJKWidthFilterFactory"/>
> <filter class="solr.CJKBigramFilterFactory"/>
> <filter class="solr.StopFilterFactory"
> words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> maxGramSize="15"/>
> <filter class="solr.PorterStemFilterFactory"/>
>  </analyzer>
>  <analyzer type="query">
> <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
>  segMode="SEARCH"/>
> <filter class="solr.CJKWidthFilterFactory"/>
> <filter class="solr.CJKBigramFilterFactory"/>
> <filter class="solr.StopFilterFactory"
> words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> <filter class="solr.PorterStemFilterFactory"/>
>       </analyzer>
>    </fieldType>
>
>
> Here's my solrconfig.xml on the highlighting portion:
>
>   <requestHandler name="/highlight" class="solr.SearchHandler">
>       <lst name="defaults">
>            <str name="echoParams">explicit</str>
>            <int name="rows">10</int>
>            <str name="wt">json</str>
>            <str name="indent">true</str>
>   <str name="df">text</str>
>   <str name="fl">id, title, content_type, last_modified, url, score </str>
>
>   <str name="hl">on</str>
>            <str name="hl.fl">id, title, content, author, tag</str>
>   <str name="hl.highlightMultiTerm">true</str>
>            <str name="hl.preserveMulti">true</str>
>            <str name="hl.encoder">html</str>
>   <str name="hl.fragsize">200</str>
> <str name="group">true</str>
> <str name="group.field">signature</str>
> <str name="group.main">true</str>
> <str name="group.cache.percent">100</str>
>       </lst>
>   </requestHandler>
>
>     <boundaryScanner name="breakIterator"
> class="solr.highlight.BreakIteratorBoundaryScanner">
>  <lst name="defaults">
> <str name="hl.bs.type">WORD</str>
> <str name="hl.bs.language">en</str>
> <str name="hl.bs.country">SG</str>
>  </lst>
>     </boundaryScanner>
>
>
> Meanwhile, I'll take a look at the articles too.
>
> Thank you.
>
> Regards,
> Edwin
>
>
> On 20 October 2015 at 11:32, Scott Chu <scott....@udngroup.com
> <+scott....@udngroup.com>> wrote:
>
> > Hi Edwin,
> >
> > I didn't use Jieba on Chinese (I use only CJK, very foundamental, I
> > know) so I didn't experience this problem.
> >
> > I'd suggest you post your schema.xml so we can see how you define your
> > content field and the field type it uses?
> >
> > In the mean time, refer to these articles, maybe the answer or workaround
> > can be deducted from them.
> >
> > https://issues.apache.org/jira/browse/SOLR-3390
> >
> > http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words
> >
> > http://qnalist.com/questions/667066/highlighting-marks-wrong-words
> >
> > Good luck!
> >
> >
> >
> >
> > Scott Chu，scott....@udngroup.com <+scott....@udngroup.com>
> > 2015/10/20
> >
> > ----- Original Message -----
> > *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com
> <+edwinye...@gmail.com>>
> > *To: *solr-user <solr-user@lucene.apache.org
> <+solr-user@lucene.apache.org>>
> > *Date: *2015-10-13, 17:04:29
> > *Subject: *Highlighting content field problem when using
> > JiebaTokenizerFactory
> >
> > Hi,
> >
> > I'm trying to use the JiebaTokenizerFactory to index Chinese characters
> in
> >
> > Solr. It works fine with the segmentation when I'm using
> > the Analysis function on the Solr Admin UI.
> >
> > However, when I tried to do the highlighting in Solr, it is not
> > highlighting in the correct place. For example, when I search of
> 自然環境与企業本身,
> > it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的
> >
> > Even when I search for English character like responsibility, it
> highlight
> > <em> *responsibilit<em>*y.
> >
> > Basically, the highlighting goes off by 1 character/space consistently.
> >
> > This problem only happens in content field, and not in any other fields.
>
> > Does anyone knows what could be causing the issue?
> >
> > I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
> >
> >
> > Regards,
> > Edwin
> >
> >
> >
> > -----
> > 未在此訊息中找到病毒。
> > 已透過 AVG 檢查 - www.avg.com
> > 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
> >
> >
>
>
>
> -----
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6172 / 病毒庫: 4447/10853 - 發佈日期: 10/19/15
>
>

Re: Highlighting content field problem when using JiebaTokenizerFactory

Reply via email to