Edwin, Congrats on getting it to work! Would you please create a Jira issue for this and add the patch? You won't need the inline change comments -- a good description in the ticket itself will work best.
k/r, Scott On Sun, Nov 22, 2015 at 10:13 PM, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > I've tried to do some minor modification in the code under > JiebaSegmenter.java, and the highlighting seems to be fine now. > > Basically, I created another int called offset2 under process() method. > int offset2 = 0; > > Then I modified the offset to offset2 for this part of the code under > process() method. > > if (sb.length() > 0) > if (mode == SegMode.SEARCH) { > for (Word token : sentenceProcess(sb.toString())) { > // tokens.add(new SegToken(token, offset, offset += > token.length())); > tokens.add(new SegToken(token, offset2, offset2 += > token.length())); // Change to offset2 by Edwin > } > } else { > for (Word token : sentenceProcess(sb.toString())) { > if (token.length() > 2) { > Word gram2; > int j = 0; > for (; j < token.length() - 1; ++j) { > gram2 = token.subSequence(j, j + 2); > if (wordDict.containsWord(gram2.getToken())) > // tokens.add(new SegToken(gram2, offset + > j, offset + j + 2)); > tokens.add(new SegToken(gram2, offset2 + j, > offset2 + j + 2)); // Change to offset2 by Edwin > } > } > if (token.length() > 3) { > Word gram3; > int j = 0; > for (; j < token.length() - 2; ++j) { > gram3 = token.subSequence(j, j + 3); > if (wordDict.containsWord(gram3.getToken())) > // tokens.add(new SegToken(gram3, offset + > j, offset + j + 3)); > tokens.add(new SegToken(gram3, offset2 + j, > offset2 + j + 3)); // Change to offset2 by Edwin > } > } > // tokens.add(new SegToken(token, offset, offset += > token.length())); > tokens.add(new SegToken(token, offset2, offset2 += > token.length())); // Change to offset2 by Edwin > } > } > > > Not sure if this is just a workaround, or can be used as a permanent > solution > > Regards, > Edwin > > > On 28 October 2015 at 15:29, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > wrote: > > > Hi Scott, > > > > I have tried to edit the SegToken.java file in the jieba-analysis-1.0.0 > > package with a +1 at both the startOffset and endOffset value (see code > > below), and now the <em> tag of the content is shifted to the correct > place > > at the content. However, this means that in the title and other fields > > where the <em> tag is orignally at the correct place, they will get the > "org.apache.lucene.search.highlight.InvalidTokenOffsetsException" > > exception. I have temporary use another tokenizer for the other fields > > first. > > > > public SegToken(Word word, int startOffset, int endOffset) { > > this.word = word; > > this.startOffset = startOffset+1; > > this.endOffset = endOffset+1; > > } > > > > However, I don't think this can be a permanent solution, so I'm trying to > > zoom in further to the code, to see what's the difference with the > content > > and other fields. > > > > I have also find that althought JiebaTokenizer works better for Chinese > > characters, it doesn't work well for English characters. For example, if > I > > search for "water", the JiebaTokenizer will cut it as follow: > > w|at|er > > It can't cut it as a full word, which HMMChineseTokenizer is able to. > > > > Here's my configuration in schema.xml: > > > > <fieldType name="text_chinese2" class="solr.TextField" > > positionIncrementGap="100"> > > <analyzer type="index"> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory" > > segMode="SEARCH"/> > > <filter class="solr.CJKWidthFilterFactory"/> > > <filter class="solr.CJKBigramFilterFactory"/> > > <filter class="solr.StopFilterFactory" > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/> > > <filter class="solr.PorterStemFilterFactory"/> > > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" > > maxGramSize="15"/> > > </analyzer> > > <analyzer type="query"> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory" > > segMode="SEARCH"/> > > <filter class="solr.CJKWidthFilterFactory"/> > > <filter class="solr.CJKBigramFilterFactory"/> > > <filter class="solr.StopFilterFactory" > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/> > > <filter class="solr.PorterStemFilterFactory"/> > > </analyzer> > > </fieldType> > > > > Does anyone knows if JiebaTokenizer is optimised to take in English > > characters as well? > > > > Regards, > > Edwin > > > > > > On 27 October 2015 at 15:57, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > > wrote: > > > >> Hi Scott, > >> > >> Thank you for providing the links and references. Will look through > them, > >> and let you know if I find any solutions or workaround. > >> > >> Regards, > >> Edwin > >> > >> > >> On 27 October 2015 at 11:13, Scott Chu <scott....@udngroup.com> wrote: > >> > >>> > >>> Take a look at Michael's 2 articles, they might help you calrify the > >>> idea of highlighting in Solr: > >>> > >>> Changing Bits: Lucene's TokenStreams are actually graphs! > >>> > >>> > http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html > >>> > >>> Also take a look at 4th paragraph In his another article: > >>> > >>> Changing Bits: A new Lucene highlighter is born > >>> > >>> > http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html > >>> > >>> Currently, I can't figure out the possible cause of your problem unless > >>> I got spare time to test it on my own, which is not available these > days > >>> (Got some projects to close)! > >>> > >>> If you find the solution or workaround, pls. let us know. Good luck > >>> again! > >>> > >>> Scott Chu,scott....@udngroup.com > >>> 2015/10/27 > >>> > >>> ----- Original Message ----- > >>> *From: *Scott Chu <scott....@udngroup.com> > >>> *To: *solr-user <solr-user@lucene.apache.org> > >>> *Date: *2015-10-27, 10:27:45 > >>> *Subject: *Re: Highlighting content field problem when using > >>> JiebaTokenizerFactory > >>> > >>> Hi Edward, > >>> > >>> Took a lot of time to see if there's anything can help you to > >>> define the cause of your problem. Maybe this might help you a bit: > >>> > >>> [SOLR-4722] Highlighter which generates a list of query term > position(s) > >>> for each item in a list of documents, or returns null if highlighting > is > >>> disabled. - AS... > >>> https://issues.apache.org/jira/browse/SOLR-4722 > >>> > >>> This one is modified from FastVectorHighLighter, so ensure those 3 > term* > >>> attributes are on. > >>> > >>> Scott Chu,scott....@udngroup.com > >>> 2015/10/27 > >>> > >>> ----- Original Message ----- > >>> *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com> > >>> *To: *solr-user <solr-user@lucene.apache.org> > >>> *Date: *2015-10-23, 10:42:32 > >>> *Subject: *Re: Highlighting content field problem when using > >>> JiebaTokenizerFactory > >>> > >>> Hi Scott, > >>> > >>> Thank you for your respond. > >>> > >>> 1. You said the problem only happens on "contents" field, so maybe > >>> there're > >>> something wrong with the contents of that field. Doe it contain any > >>> special > >>> thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions > >>> something about HTML stripping will cause highlight problem. Maybe you > >>> can > >>> > >>> try purify that fields to be closed to pure text and see if highlight > >>> comes > >>> ok. > >>> *A) I check that the SOLR-42 is mentioning about the > >>> HTMLStripWhiteSpaceTokenizerFactory, which I'm not using. I believe > that > >>> tokenizer is already deprecated too. I've tried with all kinds of > content > >>> for rich-text documents, and all of them have the same problem.* > >>> > >>> 2. Maybe something imcompatible between JiebaTokenizer and Solr > >>> highlighter. If you switch to other tokenizers, e.g. Standard, CJK, > >>> SmartChinese (I don't use this since I am dealing with Traditional > >>> Chinese > >>> > >>> but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg > >>> and > >>> > >>> see if the problem goes away. However when I'm googling similar > problem, > >>> I > >>> > >>> saw you asked same question on August at Huaban/Jieba-analysis and > >>> somebody > >>> said he also uses JiebaTokenizer but he doesn't have your problem. So I > >>> see > >>> this could be less suspect. > >>> *A) I was thinking about the incompatible issue too, as I previously > >>> thought that JiebaTokenizer is optimised for Solr 4.x, so it may have > >>> issue > >>> in 5.x. But the person from Hunban/Jieba-analysis said that he doesn't > >>> have > >>> this problem in Solr 5.1. I also face the same problem in Solr 5.1, and > >>> although I'm using Solr 5.3.0 now, the same problem persist. * > >>> > >>> I'm looking at the indexing process too, to see if there's any problem > >>> there. But just can't figure out why it only happen to JiebaTokenizer, > >>> and > >>> > >>> it only happen for content field. > >>> > >>> > >>> Regards, > >>> Edwin > >>> > >>> > >>> On 23 October 2015 at 09:41, Scott Chu <scott....@udngroup.com > >>> <+scott....@udngroup.com>> wrote: > >>> > >>> > Hi Edwin, > >>> > > >>> > Since you've tested all my suggestions and the problem is still > there, > >>> I > >>> > >>> > can't think of anything wrong with your configuration. Now I can only > >>> > suspect two things: > >>> > > >>> > 1. You said the problem only happens on "contents" field, so maybe > >>> > there're something wrong with the contents of that field. Doe it > >>> contain > >>> > >>> > any special thing in them, e.g. HTML tags or symbols. I recall > SOLR-42 > >>> > mentions something about HTML stripping will cause highlight problem. > >>> Maybe > >>> > you can try purify that fields to be closed to pure text and see if > >>> > highlight comes ok. > >>> > > >>> > 2. Maybe something imcompatible between JiebaTokenizer and Solr > >>> > highlighter. If you switch to other tokenizers, e.g. Standard, CJK, > >>> > SmartChinese (I don't use this since I am dealing with Traditional > >>> Chinese > >>> > but I see you are dealing with Simplified Chinese), or 3rd-party > MMSeg > >>> and > >>> > see if the problem goes away. However when I'm googling similar > >>> problem, I > >>> > saw you asked same question on August at Huaban/Jieba-analysis and > >>> somebody > >>> > said he also uses JiebaTokenizer but he doesn't have your problem. So > >>> I see > >>> > this could be less suspect. > >>> > > >>> > The theory of your problem could be something in indexing process > >>> causes > >>> > >>> > wrong position info. for that field and when Solr do highlighting, it > >>> > retrieves wrong position info. and mark wrong position of highlight > >>> target > >>> > terms. > >>> > > >>> > Scott Chu,scott....@udngroup.com <+scott....@udngroup.com> > >>> > 2015/10/23 > >>> > > >>> > ----- Original Message ----- > >>> > *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com > >>> <+edwinye...@gmail.com>> > >>> > *To: *solr-user <solr-user@lucene.apache.org > >>> <+solr-user@lucene.apache.org>> > >>> > *Date: *2015-10-22, 22:22:14 > >>> > *Subject: *Re: Highlighting content field problem when using > >>> > JiebaTokenizerFactory > >>> > > >>> > Hi Scott, > >>> > > >>> > Thank you for your response and suggestions. > >>> > > >>> > With respond to your questions, here are the answers: > >>> > > >>> > 1. I take a look at Jieba. It uses a dictionary and it seems to do a > >>> good > >>> > job on CJK. I doubt this problem may be from those filters (note: I > can > >>> > understand you may use CJKWidthFilter to convert Japanese but doesn't > >>> > understand why you use CJKBigramFilter and EdgeNGramFilter). Have you > >>> tried > >>> > commenting out those filters, say leave only Jieba and StopFilter, > and > >>> see > >>> > > >>> > if this problem disppears? > >>> > *A) Yes, I have tried commenting out the other filters and only left > >>> with > >>> > Jieba and StopFilter. The problem is still there.* > >>> > > >>> > 2.Does this problem occur only on Chinese search words? Does it > happen > >>> on > >>> > English search words? > >>> > *A) Yes, the same problem occurs on English words. For example, when > I > >>> > search for "word", it will highlight in this way: <em> wor<em>d* > >>> > > >>> > 3.To use FastVectorHighlighter, you seem to have to enable 3 term* > >>> > parameters in field declaration? I see only one is enabled. Please > >>> refer to > >>> > the answer in this stackoverflow question: > >>> > > >>> > > >>> > http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only > >>> > *A) I have tried to enable all 3 terms in the FastVectorHighlighter > >>> too, > >>> > >>> > but the same problem persists as well.* > >>> > > >>> > > >>> > Regards, > >>> > Edwin > >>> > > >>> > > >>> > On 22 October 2015 at 16:25, Scott Chu <scott....@udngroup.com > >>> <+scott....@udngroup.com> > >>> > <+scott....@udngroup.com <+scott....@udngroup.com>>> wrote: > >>> > > >>> > > Hi solr-user, > >>> > > > >>> > > Can't judge the cause on fast glimpse of your definition but some > >>> > > suggestions I can give: > >>> > > > >>> > > 1. I take a look at Jieba. It uses a dictionary and it seems to do > a > >>> good > >>> > > job on CJK. I doubt this problem may be from those filters (note: I > >>> can > >>> > > understand you may use CJKWidthFilter to convert Japanese but > doesn't > >>> > > understand why you use CJKBigramFilter and EdgeNGramFilter). Have > you > >>> > tried > >>> > > commenting out those filters, say leave only Jieba and StopFilter, > >>> and > >>> > >>> > see > >>> > > if this problem disppears? > >>> > > > >>> > > 2.Does this problem occur only on Chinese search words? Does it > >>> happen on > >>> > > English search words? > >>> > > > >>> > > 3.To use FastVectorHighlighter, you seem to have to enable 3 term* > >>> > > parameters in field declaration? I see only one is enabled. Please > >>> refer > >>> > to > >>> > > the answer in this stackoverflow question: > >>> > > > >>> > > >>> > http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only > >>> > > > >>> > > > >>> > > Scott Chu,scott....@udngroup.com <+scott....@udngroup.com> <+ > >>> scott....@udngroup.com <+scott....@udngroup.com>> > >>> > > 2015/10/22 > >>> > > > >>> > > ----- Original Message ----- > >>> > > *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com > >>> <+edwinye...@gmail.com> > >>> > <+edwinye...@gmail.com <+edwinye...@gmail.com>>> > >>> > > *To: *solr-user <solr-user@lucene.apache.org > >>> <+solr-user@lucene.apache.org> > >>> > <+solr-user@lucene.apache.org <+solr-user@lucene.apache.org>>> > >>> > > *Date: *2015-10-20, 12:04:11 > >>> > > *Subject: *Re: Highlighting content field problem when using > >>> > > >>> > > JiebaTokenizerFactory > >>> > > > >>> > > Hi Scott, > >>> > > > >>> > > Here's my schema.xml for content and title, which uses > text_chinese. > >>> The > >>> > > >>> > > problem only occurs in content, and not in title. > >>> > > > >>> > > <field name="content" type="text_chinese" indexed="true" > >>> stored="true" > >>> > > omitNorms="true" termVectors="true"/> > >>> > > <field name="title" type="text_chinese" indexed="true" > stored="true" > >>> > > omitNorms="true" termVectors="true"/> > >>> > > > >>> > > > >>> > > <fieldType name="text_chinese" class="solr.TextField" > >>> > > positionIncrementGap="100"> > >>> > > <analyzer type="index"> > >>> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory" > >>> > > segMode="SEARCH"/> > >>> > > <filter class="solr.CJKWidthFilterFactory"/> > >>> > > <filter class="solr.CJKBigramFilterFactory"/> > >>> > > <filter class="solr.StopFilterFactory" > >>> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/> > >>> > > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" > >>> > > maxGramSize="15"/> > >>> > > <filter class="solr.PorterStemFilterFactory"/> > >>> > > </analyzer> > >>> > > <analyzer type="query"> > >>> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory" > >>> > > segMode="SEARCH"/> > >>> > > <filter class="solr.CJKWidthFilterFactory"/> > >>> > > <filter class="solr.CJKBigramFilterFactory"/> > >>> > > <filter class="solr.StopFilterFactory" > >>> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/> > >>> > > <filter class="solr.PorterStemFilterFactory"/> > >>> > > </analyzer> > >>> > > </fieldType> > >>> > > > >>> > > > >>> > > Here's my solrconfig.xml on the highlighting portion: > >>> > > > >>> > > <requestHandler name="/highlight" class="solr.SearchHandler"> > >>> > > <lst name="defaults"> > >>> > > <str name="echoParams">explicit</str> > >>> > > <int name="rows">10</int> > >>> > > <str name="wt">json</str> > >>> > > <str name="indent">true</str> > >>> > > <str name="df">text</str> > >>> > > <str name="fl">id, title, content_type, last_modified, url, score > >>> </str> > >>> > > > >>> > > <str name="hl">on</str> > >>> > > <str name="hl.fl">id, title, content, author, tag</str> > >>> > > <str name="hl.highlightMultiTerm">true</str> > >>> > > <str name="hl.preserveMulti">true</str> > >>> > > <str name="hl.encoder">html</str> > >>> > > <str name="hl.fragsize">200</str> > >>> > > <str name="group">true</str> > >>> > > <str name="group.field">signature</str> > >>> > > <str name="group.main">true</str> > >>> > > <str name="group.cache.percent">100</str> > >>> > > </lst> > >>> > > </requestHandler> > >>> > > > >>> > > <boundaryScanner name="breakIterator" > >>> > > class="solr.highlight.BreakIteratorBoundaryScanner"> > >>> > > <lst name="defaults"> > >>> > > <str name="hl.bs.type">WORD</str> > >>> > > <str name="hl.bs.language">en</str> > >>> > > <str name="hl.bs.country">SG</str> > >>> > > </lst> > >>> > > </boundaryScanner> > >>> > > > >>> > > > >>> > > Meanwhile, I'll take a look at the articles too. > >>> > > > >>> > > Thank you. > >>> > > > >>> > > Regards, > >>> > > Edwin > >>> > > > >>> > > > >>> > > On 20 October 2015 at 11:32, Scott Chu <scott....@udngroup.com > >>> <+scott....@udngroup.com> > >>> > <+scott....@udngroup.com <+scott....@udngroup.com>> > >>> > > <+scott....@udngroup.com <+scott....@udngroup.com> <+ > >>> scott....@udngroup.com <+scott....@udngroup.com>>>> wrote: > >>> > > > >>> > > > Hi Edwin, > >>> > > > > >>> > > > I didn't use Jieba on Chinese (I use only CJK, very > foundamental, I > >>> > > > know) so I didn't experience this problem. > >>> > > > > >>> > > > I'd suggest you post your schema.xml so we can see how you define > >>> your > >>> > > >>> > > > content field and the field type it uses? > >>> > > > > >>> > > > In the mean time, refer to these articles, maybe the answer or > >>> > workaround > >>> > > > can be deducted from them. > >>> > > > > >>> > > > https://issues.apache.org/jira/browse/SOLR-3390 > >>> > > > > >>> > > > > >>> http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words > >>> > >>> > > > > >>> > > > > http://qnalist.com/questions/667066/highlighting-marks-wrong-words > >>> > > > > >>> > > > Good luck! > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > Scott Chu,scott....@udngroup.com <+scott....@udngroup.com> <+ > >>> scott....@udngroup.com <+scott....@udngroup.com>> <+ > >>> > scott....@udngroup.com <+scott....@udngroup.com> <+ > >>> scott....@udngroup.com <+scott....@udngroup.com>>> > >>> > > > 2015/10/20 > >>> > > > > >>> > > > ----- Original Message ----- > >>> > > > *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com > >>> <+edwinye...@gmail.com> > >>> > <+edwinye...@gmail.com <+edwinye...@gmail.com>> > >>> > > <+edwinye...@gmail.com <+edwinye...@gmail.com> <+ > >>> edwinye...@gmail.com <+edwinye...@gmail.com>>>> > >>> > > > *To: *solr-user <solr-user@lucene.apache.org > >>> <+solr-user@lucene.apache.org> > >>> > <+solr-user@lucene.apache.org <+solr-user@lucene.apache.org>> > >>> > > <+solr-user@lucene.apache.org <+solr-user@lucene.apache.org> <+ > >>> solr-user@lucene.apache.org <+solr-user@lucene.apache.org>>>> > >>> > > >>> > > > *Date: *2015-10-13, 17:04:29 > >>> > > > *Subject: *Highlighting content field problem when using > >>> > > > JiebaTokenizerFactory > >>> > > > > >>> > > > Hi, > >>> > > > > >>> > > > I'm trying to use the JiebaTokenizerFactory to index Chinese > >>> characters > >>> > > in > >>> > > > > >>> > > > Solr. It works fine with the segmentation when I'm using > >>> > > > the Analysis function on the Solr Admin UI. > >>> > > > > >>> > > > However, when I tried to do the highlighting in Solr, it is not > >>> > > > highlighting in the correct place. For example, when I search of > >>> > > 自然環境与企業本身, > >>> > > > it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的 > >>> > > > > >>> > > > Even when I search for English character like responsibility, it > >>> > > highlight > >>> > > > <em> *responsibilit<em>*y. > >>> > > > > >>> > > > Basically, the highlighting goes off by 1 character/space > >>> consistently. > >>> > > > > >>> > > > This problem only happens in content field, and not in any other > >>> > fields. > >>> > > > >>> > > > Does anyone knows what could be causing the issue? > >>> > > > > >>> > > > I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0. > >>> > > > > >>> > > > > >>> > > > Regards, > >>> > > > Edwin > >>> > > > > >>> > > > > >>> > > > > >>> > > > ----- > >>> > > > 未在此訊息中找到病毒。 > >>> > > > 已透過 AVG 檢查 - www.avg.com > >>> > > > 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15 > >>> > > > > >>> > > > > >>> > > > >>> > > > >>> > > > >>> > > ----- > >>> > > 未在此訊息中找到病毒。 > >>> > > 已透過 AVG 檢查 - www.avg.com > >>> > > 版本: 2015.0.6172 / 病毒庫: 4447/10853 - 發佈日期: 10/19/15 > >>> > > > >>> > > > >>> > > >>> > > >>> > > >>> > ----- > >>> > 未在此訊息中找到病毒。 > >>> > 已透過 AVG 檢查 - www.avg.com > >>> > 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15 > >>> > > >>> > > >>> > >>> > >>> > >>> ----- > >>> 未在此訊息中找到病毒。 > >>> 已透過 AVG 檢查 - www.avg.com > >>> 版本: 2015.0.6173 / 病毒庫: 4450/10871 - 發佈日期: 10/22/15 > >>> > >>> > >> > > > -- Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC | 434.409.2780 http://www.opensourceconnections.com