[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left
[ https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006704#comment-17006704 ] ASF subversion and git services commented on LUCENE-9093: - Commit 4c9cc2cefd7f3593c4b4e1e5a087e3d206298989 in lucene-solr's branch refs/heads/gradle-master from Nándor Mátravölgyi [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4c9cc2c ] LUCENE-9093: UnifiedHighlighter LengthGoalBreakIterator frag align Matches in passages should be centered better on average. Closes #1123 > Unified highlighter with word separator never gives context to the left > --- > > Key: LUCENE-9093 > URL: https://issues.apache.org/jira/browse/LUCENE-9093 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: Tim Retout >Assignee: David Smiley >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-9093.patch > > Time Spent: 8h 10m > Remaining Estimate: 0h > > When using the unified highlighter with hl.bs.type=WORD, I am not able to get > context to the left of the matches returned; only words to the right of each > match are shown. I see this behaviour on both Solr 6.4 and Solr 7.1. > Without context to the left of a match, the highlighted snippets are much > less useful for understanding where the match appears in a document. > As an example, using the techproducts data with Solr 7.1, given a search for > "apple", highlighting the "features" field: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified > I see this snippet: > "Apple Lossless, H.264 video" > Note that "Apple" is anchored to the left. Compare with the original > highlighter: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30 > And the match has context either side: > ", Audible, Apple Lossless, H.264 video" > (To complicate this, in general I am not sure that the unified highlighter is > respecting the hl.fragsize parameter, although [SOLR-9935] suggests support > was added. I included the hl.fragsize param in the unified URL too, but it's > making no difference unless set to 0.) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left
[ https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006316#comment-17006316 ] ASF subversion and git services commented on LUCENE-9093: - Commit 5874b9c7933233712da14c5a5b9bb4f916eb77f8 in lucene-solr's branch refs/heads/branch_8x from Nándor Mátravölgyi [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=5874b9c ] LUCENE-9093: UnifiedHighlighter LengthGoalBreakIterator frag align Matches in passages should be centered better on average. Closes #1123 (cherry picked from commit 4c9cc2cefd7f3593c4b4e1e5a087e3d206298989) > Unified highlighter with word separator never gives context to the left > --- > > Key: LUCENE-9093 > URL: https://issues.apache.org/jira/browse/LUCENE-9093 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: Tim Retout >Priority: Major > Attachments: LUCENE-9093.patch > > Time Spent: 8h 10m > Remaining Estimate: 0h > > When using the unified highlighter with hl.bs.type=WORD, I am not able to get > context to the left of the matches returned; only words to the right of each > match are shown. I see this behaviour on both Solr 6.4 and Solr 7.1. > Without context to the left of a match, the highlighted snippets are much > less useful for understanding where the match appears in a document. > As an example, using the techproducts data with Solr 7.1, given a search for > "apple", highlighting the "features" field: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified > I see this snippet: > "Apple Lossless, H.264 video" > Note that "Apple" is anchored to the left. Compare with the original > highlighter: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30 > And the match has context either side: > ", Audible, Apple Lossless, H.264 video" > (To complicate this, in general I am not sure that the unified highlighter is > respecting the hl.fragsize parameter, although [SOLR-9935] suggests support > was added. I included the hl.fragsize param in the unified URL too, but it's > making no difference unless set to 0.) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left
[ https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006312#comment-17006312 ] ASF subversion and git services commented on LUCENE-9093: - Commit 4c9cc2cefd7f3593c4b4e1e5a087e3d206298989 in lucene-solr's branch refs/heads/master from Nándor Mátravölgyi [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4c9cc2c ] LUCENE-9093: UnifiedHighlighter LengthGoalBreakIterator frag align Matches in passages should be centered better on average. Closes #1123 > Unified highlighter with word separator never gives context to the left > --- > > Key: LUCENE-9093 > URL: https://issues.apache.org/jira/browse/LUCENE-9093 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: Tim Retout >Priority: Major > Attachments: LUCENE-9093.patch > > Time Spent: 8h 10m > Remaining Estimate: 0h > > When using the unified highlighter with hl.bs.type=WORD, I am not able to get > context to the left of the matches returned; only words to the right of each > match are shown. I see this behaviour on both Solr 6.4 and Solr 7.1. > Without context to the left of a match, the highlighted snippets are much > less useful for understanding where the match appears in a document. > As an example, using the techproducts data with Solr 7.1, given a search for > "apple", highlighting the "features" field: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified > I see this snippet: > "Apple Lossless, H.264 video" > Note that "Apple" is anchored to the left. Compare with the original > highlighter: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30 > And the match has context either side: > ", Audible, Apple Lossless, H.264 video" > (To complicate this, in general I am not sure that the unified highlighter is > respecting the hl.fragsize parameter, although [SOLR-9935] suggests support > was added. I included the hl.fragsize param in the unified URL too, but it's > making no difference unless set to 0.) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left
[ https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005403#comment-17005403 ] David Smiley commented on LUCENE-9093: -- The PR is concluding and I want to summarize here: Proposed LUCENE CHANGES.txt: (two places so we draw attention to something) API Changes: LUCENE-9093: Not an API change but a change in behavior of the UnifiedHighlighter's LengthGoalBreakIterator that will yield Passages sized a little different due to the fact that the sizing pivot is now the center of the first match and not its left edge. Improvements: LUCENE-9093: UnifiedHighlighter's LengthGoalBreakIterator has a new fragmentAlighnment option to better center the first match in the passage. Also the sizing point now pivots at the center of the first match term and not its left edge. This yileds Passages that won't be identical to the previous behavior. (Nándor Mátravölgyi, David Smiley) Proposed SOLR CHANGES.txt: Improvements: LUCENE-9093: The Unified highlighter has two new passage sizing parameters, hl.fragAlignRatio and hl.fragsizeIsMinimum, with defaults that aim to better center matches in fragments than previously. See the ref guide. Regardless of the settings, the passages may be sized differently than before. (Nándor Mátravölgyi, David Smiley) > Unified highlighter with word separator never gives context to the left > --- > > Key: LUCENE-9093 > URL: https://issues.apache.org/jira/browse/LUCENE-9093 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: Tim Retout >Priority: Major > Attachments: LUCENE-9093.patch > > Time Spent: 8h > Remaining Estimate: 0h > > When using the unified highlighter with hl.bs.type=WORD, I am not able to get > context to the left of the matches returned; only words to the right of each > match are shown. I see this behaviour on both Solr 6.4 and Solr 7.1. > Without context to the left of a match, the highlighted snippets are much > less useful for understanding where the match appears in a document. > As an example, using the techproducts data with Solr 7.1, given a search for > "apple", highlighting the "features" field: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified > I see this snippet: > "Apple Lossless, H.264 video" > Note that "Apple" is anchored to the left. Compare with the original > highlighter: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30 > And the match has context either side: > ", Audible, Apple Lossless, H.264 video" > (To complicate this, in general I am not sure that the unified highlighter is > respecting the hl.fragsize parameter, although [SOLR-9935] suggests support > was added. I included the hl.fragsize param in the unified URL too, but it's > making no difference unless set to 0.) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left
[ https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002839#comment-17002839 ] David Smiley commented on LUCENE-9093: -- In this situation, do it in two phases. Phase 1 is what you have already and would be merged to both branches. Phase 2 changes the default and only merges to master branch. This can all happen under this issue ID. > Unified highlighter with word separator never gives context to the left > --- > > Key: LUCENE-9093 > URL: https://issues.apache.org/jira/browse/LUCENE-9093 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: Tim Retout >Priority: Major > Attachments: LUCENE-9093.patch > > > When using the unified highlighter with hl.bs.type=WORD, I am not able to get > context to the left of the matches returned; only words to the right of each > match are shown. I see this behaviour on both Solr 6.4 and Solr 7.1. > Without context to the left of a match, the highlighted snippets are much > less useful for understanding where the match appears in a document. > As an example, using the techproducts data with Solr 7.1, given a search for > "apple", highlighting the "features" field: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified > I see this snippet: > "Apple Lossless, H.264 video" > Note that "Apple" is anchored to the left. Compare with the original > highlighter: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30 > And the match has context either side: > ", Audible, Apple Lossless, H.264 video" > (To complicate this, in general I am not sure that the unified highlighter is > respecting the hl.fragsize parameter, although [SOLR-9935] suggests support > was added. I included the hl.fragsize param in the unified URL too, but it's > making no difference unless set to 0.) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left
[ https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002807#comment-17002807 ] Lucene/Solr QA commented on LUCENE-9093: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 30s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Release audit (RAT) {color} | {color:green} 0m 32s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} Check forbidden APIs {color} | {color:red} 0m 23s{color} | {color:red} Check forbidden APIs check-forbidden-apis failed {color} | | {color:red}-1{color} | {color:red} Validate source patterns {color} | {color:red} 0m 23s{color} | {color:red} Check forbidden APIs check-forbidden-apis failed {color} | | {color:red}-1{color} | {color:red} Validate ref guide {color} | {color:red} 0m 23s{color} | {color:red} Check forbidden APIs check-forbidden-apis failed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 40s{color} | {color:green} highlighter in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 45m 59s{color} | {color:green} core in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 5m 37s{color} | {color:green} solrj in the patch passed. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 56m 20s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | LUCENE-9093 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12989051/LUCENE-9093.patch | | Optional Tests | compile javac unit ratsources checkforbiddenapis validatesourcepatterns validaterefguide | | uname | Linux lucene1-us-west 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | ant | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh | | git revision | master / 72c99e921c4 | | ant | version: Apache Ant(TM) version 1.10.5 compiled on March 28 2019 | | Default Java | LTS | | Check forbidden APIs | https://builds.apache.org/job/PreCommit-LUCENE-Build/245/artifact/out/patch-check-forbidden-apis-root.txt | | Validate source patterns | https://builds.apache.org/job/PreCommit-LUCENE-Build/245/artifact/out/patch-check-forbidden-apis-root.txt | | Validate ref guide | https://builds.apache.org/job/PreCommit-LUCENE-Build/245/artifact/out/patch-check-forbidden-apis-root.txt | | Test Results | https://builds.apache.org/job/PreCommit-LUCENE-Build/245/testReport/ | | modules | C: lucene lucene/highlighter solr/core solr/solrj solr/solr-ref-guide U: . | | Console output | https://builds.apache.org/job/PreCommit-LUCENE-Build/245/console | | Powered by | Apache Yetus 0.7.0 http://yetus.apache.org | This message was automatically generated. > Unified highlighter with word separator never gives context to the left > --- > > Key: LUCENE-9093 > URL: https://issues.apache.org/jira/browse/LUCENE-9093 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: Tim Retout >Priority: Major > Attachments: LUCENE-9093.patch > > > When using the unified highlighter with hl.bs.type=WORD, I am not able to get > context to the left of the matches returned; only words to the right of each > match are shown. I see this behaviour on both Solr 6.4 and Solr 7.1. > Without context to the left of a match, the highlighted snippets are much > less useful for understanding where the match appears in a document. > As an example, using the techproducts data with Solr 7.1, given a search for > "apple", highlighting the "features" field: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified > I see this snippet: > "Apple Lossless, H.264 video" > Note that "Apple" is anchored to the left. Compare with the original
[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left
[ https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002740#comment-17002740 ] Nándor Mátravölgyi commented on LUCENE-9093: How should I make pull requests with the different default fragalign [~dsmiley] ? My guess is that the PR to master should have the default fragalign of 0.5 (and modified docs), while I also make a PR to the 8x and 7x branch with the original patch. This way after the master is accepted and merged to the others, their PR can be accepted and cherry picked on the differences. I'll wait for your input on this. > Unified highlighter with word separator never gives context to the left > --- > > Key: LUCENE-9093 > URL: https://issues.apache.org/jira/browse/LUCENE-9093 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: Tim Retout >Priority: Major > Attachments: LUCENE-9093.patch > > > When using the unified highlighter with hl.bs.type=WORD, I am not able to get > context to the left of the matches returned; only words to the right of each > match are shown. I see this behaviour on both Solr 6.4 and Solr 7.1. > Without context to the left of a match, the highlighted snippets are much > less useful for understanding where the match appears in a document. > As an example, using the techproducts data with Solr 7.1, given a search for > "apple", highlighting the "features" field: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified > I see this snippet: > "Apple Lossless, H.264 video" > Note that "Apple" is anchored to the left. Compare with the original > highlighter: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30 > And the match has context either side: > ", Audible, Apple Lossless, H.264 video" > (To complicate this, in general I am not sure that the unified highlighter is > respecting the hl.fragsize parameter, although [SOLR-9935] suggests support > was added. I included the hl.fragsize param in the unified URL too, but it's > making no difference unless set to 0.) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left
[ https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002548#comment-17002548 ] Nándor Mátravölgyi commented on LUCENE-9093: I could look into making this a github PR tomorrow... I'll change the default fragalign to 0.5 as well. It also works in SENTENCE mode, but the results won't be as accurate in some cases. Let me elaborate. In any mode the selected BreakIterator (WORD, SEPARATOR, SENTENCE, etc.) makes the decision on where a slice can happen. The first slice always contains the match. The LengthGoalBreakIterator will decide which side of the first slice should the selected BI add more slices to. The logic is generic and will work regardless of the underlying BI. Since the snippet will be grown until it reaches fragsize, the size of the last slice to be added will determine how big to overshoot is. Examples in SENTENCE mode: Example text: _Hello Susan! I cannot believe the weather is unreal again! The sky is green. I hope Mrs Smith will bring an umbrella for the picnic. Let's not panic._ # If the fragsize is smaller than the first slice (sentence in this case), no expansion will happen in either direction. Note that fragalign is N/A in this case. {noformat} q=sky=0.5=10 makes snippet length of 17 The sky is green.{noformat} # If the fragsize is bigger than the first slice and the fragalign is 0.5, the slice will be expanded on the left first and then on the right if any space is left. {noformat} q=sky=0.5=30 makes snippet length of 63 I cannot believe the weather is unreal again! The sky is green. q=sky=0.5=80 makes snippet length of 119 I cannot believe the weather is unreal again! The sky is green. I hope Mrs Smith will bring an umbrella for the picnic. q=sky=0.5=120 makes snippet length of 132 Hello Susan! I cannot believe the weather is unreal again! The sky is green. I hope Mrs Smith will bring an umbrella for the picnic.{noformat} # If the fragsize is bigger than the first slice and the fragalign is 0, the slice will be expanded on the right only. (the match is anchored to 0/left/begin) {noformat} q=sky=0.0=30 makes snippet length of 73 The sky is green. I hope Mrs Smith will bring an umbrella for the picnic. q=sky=0.0=80 makes snippet length of 90 The sky is green. I hope Mrs Smith will bring an umbrella for the picnic. Let's not panic.{noformat} # If the fragsize is bigger than the first slice and the fragalign is 1, the slice will be expanded on the left only. (the match is anchored to 1/right/end) {noformat} q=sky=1.0=30 makes snippet length of 63 I cannot believe the weather is unreal again! The sky is green. q=sky=1.0=70 makes snippet length of 76 Hello Susan! I cannot believe the weather is unreal again! The sky is green.{noformat} In the above examples there are big overshoots of the fragsize. 63 instead of 30 (+110%) and 119 instead of 80 (+49%). These would also occur if the fragalign would be 0.1, but the alignment would be even less accurate in cases where the left expansion overshoots: {noformat} q=sky=0.1=30 makes snippet length of 63 I cannot believe the weather is unreal again! The sky is green.{noformat} This is because the order of expansion is strictly left first. I guess this could be improved if so desired. In summary, to ensure the accuracy of fragsize & fragalign parameters, they have to be proportional to the approximate size of the slices. Here's how the worst expected overshoot can be calculated: {noformat} float WorstOvershootPercent(float fragsize, float avgSliceLength) { return fragsize-1)+avgSliceLength) / fragsize)-1)*100; } WORD: (words are usually 12-25 characters most) WorstOvershootPercent(15, 12)=> 73.34% WorstOvershootPercent(100, 25) => 24.00% WorstOvershootPercent(300, 25) => 8.00% SENTENCE: (a sentence can be very long) WorstOvershootPercent(300, 300) => 99.66% WorstOvershootPercent(300, 500) => 166.34% WorstOvershootPercent(2000, 300) => 14.95% WorstOvershootPercent(2000, 500) => 24.95%{noformat} The other highlighters have similar rules for this. The only thing that can improve this easily in some cases, is to search the closest length to the fragsize instead of the minimum. The LengthGoalBreakIterator has a closestTo-mode, but it's not usable because it would require yet another parameter. ([view on github|https://github.com/apache/lucene-solr/blob/1be5b689640fe4d1bf0ae3fd19c5fe93b20a77ef/solr/core/src/java/org/apache/solr/highlight/UnifiedSolrHighlighter.java#L330]) Using that mode could make an undershoot that is closer to the desired size than the overshoot. > Unified highlighter with word separator never gives context to the left > --- > > Key: LUCENE-9093 > URL: https://issues.apache.org/jira/browse/LUCENE-9093 > Project: Lucene -
[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left
[ https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002506#comment-17002506 ] David Smiley commented on LUCENE-9093: -- Sorry for the delay. Can you please post a PR as it's more conducive to the code review process? I have a question about this setting. You've declared the benefits of it for a {{hl.bs.type=WORD}} but would this also be helpful for SENTENCE too? I hope so. I think in 9.0 the {{hl.fragalign}} setting should default to {{0.5}} or maybe {{0.25}} > Unified highlighter with word separator never gives context to the left > --- > > Key: LUCENE-9093 > URL: https://issues.apache.org/jira/browse/LUCENE-9093 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: Tim Retout >Priority: Major > Attachments: LUCENE-9093.patch > > > When using the unified highlighter with hl.bs.type=WORD, I am not able to get > context to the left of the matches returned; only words to the right of each > match are shown. I see this behaviour on both Solr 6.4 and Solr 7.1. > Without context to the left of a match, the highlighted snippets are much > less useful for understanding where the match appears in a document. > As an example, using the techproducts data with Solr 7.1, given a search for > "apple", highlighting the "features" field: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified > I see this snippet: > "Apple Lossless, H.264 video" > Note that "Apple" is anchored to the left. Compare with the original > highlighter: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30 > And the match has context either side: > ", Audible, Apple Lossless, H.264 video" > (To complicate this, in general I am not sure that the unified highlighter is > respecting the hl.fragsize parameter, although [SOLR-9935] suggests support > was added. I included the hl.fragsize param in the unified URL too, but it's > making no difference unless set to 0.) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left
[ https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998660#comment-16998660 ] Nándor Mátravölgyi commented on LUCENE-9093: I'm back with a patch! [^LUCENE-9093.patch] This adds a `hl.fragalign` parameter to the Unified Highlighter. I've added a description about it in the docs on how it works. I've also updated the related tests. I've opted to keep the new feature backward-compatible. From the new docs: {noformat} Fragment alignment can influence where the match in a passage is positioned. This floating point value is used to break the remaining `hl.fragsize` of the passage around the match. The default value of `0.0` means to align the match to the left, this is the backward-compatible setting. A value of `0.5` would mean that equal amount of text should be around the match on both sides, while `1.0` to align it to the right. Note: there are situations where the requested alignment is not plausible. This depends on the length of the match, the used breakiterator and the text content around the match. Before the introduction of this parameter all passages had left-aligned matches. Changing the `hl.bs.type` to `WORD` and the `hl.fragalign` to `0.5` will make results that closely resemble what the other highlighters produce by default. {noformat} I must say that I've changed my mind about the abstraction. A proper one instead of the chained BreakIterators would be much nicer. The LengthGoalBreakIterator already had a few behavioral differences to how a generic BreakIterator works. This change makes it work even less like a BreakIterator. It should be totally fine in it's specifically crafted universe. However a better abstraction/structure would be required if we want style-points as well. The difficulty is that the chaining of the BreakItartors would need a refactor which has far greater scope than this issue for example. > Unified highlighter with word separator never gives context to the left > --- > > Key: LUCENE-9093 > URL: https://issues.apache.org/jira/browse/LUCENE-9093 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: Tim Retout >Priority: Major > Attachments: LUCENE-9093.patch > > > When using the unified highlighter with hl.bs.type=WORD, I am not able to get > context to the left of the matches returned; only words to the right of each > match are shown. I see this behaviour on both Solr 6.4 and Solr 7.1. > Without context to the left of a match, the highlighted snippets are much > less useful for understanding where the match appears in a document. > As an example, using the techproducts data with Solr 7.1, given a search for > "apple", highlighting the "features" field: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified > I see this snippet: > "Apple Lossless, H.264 video" > Note that "Apple" is anchored to the left. Compare with the original > highlighter: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30 > And the match has context either side: > ", Audible, Apple Lossless, H.264 video" > (To complicate this, in general I am not sure that the unified highlighter is > respecting the hl.fragsize parameter, although [SOLR-9935] suggests support > was added. I included the hl.fragsize param in the unified URL too, but it's > making no difference unless set to 0.) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left
[ https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997726#comment-16997726 ] David Smiley commented on LUCENE-9093: -- bq. I'd think trying to completely avoid the overlaps is preferable. Ah, I agree. This would probably involve only a very minor change to {{FieldHighlighter#highlightOffsetsEnums}}. You gave two example of "worst edge cases" and the only variance of input was wether the fragsize was 50 or 60. I don't see that it matters which looks nicer since the search app developer can't code to every case; he/she will pick 60 vs 50 vs whatever depending on how much space is available for showing snippets. Whichever is chosen, sometimes the match alignment in the snippet will be poor (not centered at all). Any way to answer your subjective question of which looks nicest, I suppose the second one looks nicer -- the one with 60 one snippet and second field match at the right end. > Unified highlighter with word separator never gives context to the left > --- > > Key: LUCENE-9093 > URL: https://issues.apache.org/jira/browse/LUCENE-9093 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: Tim Retout >Priority: Major > > When using the unified highlighter with hl.bs.type=WORD, I am not able to get > context to the left of the matches returned; only words to the right of each > match are shown. I see this behaviour on both Solr 6.4 and Solr 7.1. > Without context to the left of a match, the highlighted snippets are much > less useful for understanding where the match appears in a document. > As an example, using the techproducts data with Solr 7.1, given a search for > "apple", highlighting the "features" field: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified > I see this snippet: > "Apple Lossless, H.264 video" > Note that "Apple" is anchored to the left. Compare with the original > highlighter: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30 > And the match has context either side: > ", Audible, Apple Lossless, H.264 video" > (To complicate this, in general I am not sure that the unified highlighter is > respecting the hl.fragsize parameter, although [SOLR-9935] suggests support > was added. I included the hl.fragsize param in the unified URL too, but it's > making no difference unless set to 0.) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left
[ https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16996736#comment-16996736 ] Nándor Mátravölgyi commented on LUCENE-9093: We have the same idea how the chained breakiterators could be used to align the match in a more pleasing way. I also agree that some changes to FieldHighlighter will be necessary to handle overlaps. Your suggestion of that is about also highlighting the matches that were included in a previous Passage. I'd think trying to completely avoid the overlaps is preferable. That would make the snippets not redundant and implicitly solve the issue of needing to highlight some matches more than one time. These are examples of what the least favorable edge cases would look like when we strictly avoid overlaps, but want to have centered match alignment. The search query is "field" and the original text is: {noformat} If set to false, or if there is no match in the alternate field either, the alternate field will be shown without highlighting, but could be marked by other processors.{noformat} If the search has fragsize around 50 the first "field" word will be aligned properly. The next one will be left-aligned because the preceding text has already been used for a passage. {noformat} [ "in the alternate field either, the alternate", "field will be shown without highlighting, but" ]{noformat} If the search has fragsize around 60 the first "field" word will be aligned properly. The next one will be right-aligned because it is at the very end of the passage made for the first match. {noformat} [ "match in the alternate field either, the alternate field" ]{noformat} Now the question is: which of these is closer to what we want to see? I'd say either "worst" edge case would be much better than the constantly left-aligned matches we have currently. Note: these are close to how the other highlighters behave when they have near-boundary matches. Regarding the question of abstraction. I've not found a reason to think we need to replace the breakitartors with a new interface. I think the bulk of the fastVector's fragment builder abstraction is about tracking the matches and highlighting the terms with different styles. (note I've only looked through it briefly) Just for the sake of completeness, I'll tell you that for what I would like to do, a different concept of fragment length and snippet limit would be better. In all honesty I want an excerpt of the document that shows valuable matches in the context of a few words around them, while the whole highlight is no longer than N characters. Right now I have the configuration of fragsize=90 and snippets=3 because I want something that's not longer than 300 chars. If the highlighter could determine what differently sized fragments would yield the best excerpt, that would be the "best". A dense cluster of matches could form a 180 chars fragment while two singular matches would form two 50 chars fragment. This could be better than forcing the fragments to be uniform in size. > Unified highlighter with word separator never gives context to the left > --- > > Key: LUCENE-9093 > URL: https://issues.apache.org/jira/browse/LUCENE-9093 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: Tim Retout >Priority: Major > > When using the unified highlighter with hl.bs.type=WORD, I am not able to get > context to the left of the matches returned; only words to the right of each > match are shown. I see this behaviour on both Solr 6.4 and Solr 7.1. > Without context to the left of a match, the highlighted snippets are much > less useful for understanding where the match appears in a document. > As an example, using the techproducts data with Solr 7.1, given a search for > "apple", highlighting the "features" field: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified > I see this snippet: > "Apple Lossless, H.264 video" > Note that "Apple" is anchored to the left. Compare with the original > highlighter: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30 > And the match has context either side: > ", Audible, Apple Lossless, H.264 video" > (To complicate this, in general I am not sure that the unified highlighter is > respecting the hl.fragsize parameter, although [SOLR-9935] suggests support > was added. I included the hl.fragsize param in the unified URL too, but it's > making no difference unless set to 0.) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail:
[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left
[ https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995916#comment-16995916 ] David Smiley commented on LUCENE-9093: -- This is a very thoughtful response [~myusername8]. I'm really glad you are willing to contribute :-) An idea I have thought of before is to try to get more leading context before the first word. Basically compute half the fragsize as the amount of leading text we'd like (configurable ratio). Then keep looping over sub-BreakIterator calls to preceding() until we reach this target. Strictly speaking, the BreakIterator generically has no concept of a highlighting "match" but these special-purpose BreakIterators are used in the concept of the UnifiedHighlighter and know that when preceding() is called, it's at the first match of a passage. WDYT? Unfortunately I think it would yield Passages that overlap, and that subsequent Passages would not contain the matches of the previous overlapping passages. :-/. Maybe this could be overcome by FieldHighlighter detecting this and adding the pertinent matches from the most recent Passage. I'm aware that the use of BreakIterator is limiting, constraining our solution space. And it puts undo extra work on us to implement the JDK defined abstraction. Perhaps like the FVH, the UH needs it's own abstraction here. CC [~romseygeek] > Unified highlighter with word separator never gives context to the left > --- > > Key: LUCENE-9093 > URL: https://issues.apache.org/jira/browse/LUCENE-9093 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: Tim Retout >Priority: Major > > When using the unified highlighter with hl.bs.type=WORD, I am not able to get > context to the left of the matches returned; only words to the right of each > match are shown. I see this behaviour on both Solr 6.4 and Solr 7.1. > Without context to the left of a match, the highlighted snippets are much > less useful for understanding where the match appears in a document. > As an example, using the techproducts data with Solr 7.1, given a search for > "apple", highlighting the "features" field: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified > I see this snippet: > "Apple Lossless, H.264 video" > Note that "Apple" is anchored to the left. Compare with the original > highlighter: > http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30 > And the match has context either side: > ", Audible, Apple Lossless, H.264 video" > (To complicate this, in general I am not sure that the unified highlighter is > respecting the hl.fragsize parameter, although [SOLR-9935] suggests support > was added. I included the hl.fragsize param in the unified URL too, but it's > making no difference unless set to 0.) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org