[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left

2020-01-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006704#comment-17006704
 ] 

ASF subversion and git services commented on LUCENE-9093:
-

Commit 4c9cc2cefd7f3593c4b4e1e5a087e3d206298989 in lucene-solr's branch 
refs/heads/gradle-master from Nándor Mátravölgyi
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4c9cc2c ]

LUCENE-9093: UnifiedHighlighter LengthGoalBreakIterator frag align
 Matches in passages should be centered better on average.
 Closes #1123


> Unified highlighter with word separator never gives context to the left
> ---
>
> Key: LUCENE-9093
> URL: https://issues.apache.org/jira/browse/LUCENE-9093
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: Tim Retout
>Assignee: David Smiley
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-9093.patch
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> When using the unified highlighter with hl.bs.type=WORD, I am not able to get 
> context to the left of the matches returned; only words to the right of each 
> match are shown.  I see this behaviour on both Solr 6.4 and Solr 7.1.
> Without context to the left of a match, the highlighted snippets are much 
> less useful for understanding where the match appears in a document.
> As an example, using the techproducts data with Solr 7.1, given a search for 
> "apple", highlighting the "features" field:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified
> I see this snippet:
> "Apple Lossless, H.264 video"
> Note that "Apple" is anchored to the left.  Compare with the original 
> highlighter:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30
> And the match has context either side:
> ", Audible, Apple Lossless, H.264 video"
> (To complicate this, in general I am not sure that the unified highlighter is 
> respecting the hl.fragsize parameter, although [SOLR-9935] suggests support 
> was added.  I included the hl.fragsize param in the unified URL too, but it's 
> making no difference unless set to 0.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left

2019-12-31 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006316#comment-17006316
 ] 

ASF subversion and git services commented on LUCENE-9093:
-

Commit 5874b9c7933233712da14c5a5b9bb4f916eb77f8 in lucene-solr's branch 
refs/heads/branch_8x from Nándor Mátravölgyi
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=5874b9c ]

LUCENE-9093: UnifiedHighlighter LengthGoalBreakIterator frag align
 Matches in passages should be centered better on average.
 Closes #1123

(cherry picked from commit 4c9cc2cefd7f3593c4b4e1e5a087e3d206298989)


> Unified highlighter with word separator never gives context to the left
> ---
>
> Key: LUCENE-9093
> URL: https://issues.apache.org/jira/browse/LUCENE-9093
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: Tim Retout
>Priority: Major
> Attachments: LUCENE-9093.patch
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> When using the unified highlighter with hl.bs.type=WORD, I am not able to get 
> context to the left of the matches returned; only words to the right of each 
> match are shown.  I see this behaviour on both Solr 6.4 and Solr 7.1.
> Without context to the left of a match, the highlighted snippets are much 
> less useful for understanding where the match appears in a document.
> As an example, using the techproducts data with Solr 7.1, given a search for 
> "apple", highlighting the "features" field:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified
> I see this snippet:
> "Apple Lossless, H.264 video"
> Note that "Apple" is anchored to the left.  Compare with the original 
> highlighter:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30
> And the match has context either side:
> ", Audible, Apple Lossless, H.264 video"
> (To complicate this, in general I am not sure that the unified highlighter is 
> respecting the hl.fragsize parameter, although [SOLR-9935] suggests support 
> was added.  I included the hl.fragsize param in the unified URL too, but it's 
> making no difference unless set to 0.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left

2019-12-31 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006312#comment-17006312
 ] 

ASF subversion and git services commented on LUCENE-9093:
-

Commit 4c9cc2cefd7f3593c4b4e1e5a087e3d206298989 in lucene-solr's branch 
refs/heads/master from Nándor Mátravölgyi
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4c9cc2c ]

LUCENE-9093: UnifiedHighlighter LengthGoalBreakIterator frag align
 Matches in passages should be centered better on average.
 Closes #1123


> Unified highlighter with word separator never gives context to the left
> ---
>
> Key: LUCENE-9093
> URL: https://issues.apache.org/jira/browse/LUCENE-9093
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: Tim Retout
>Priority: Major
> Attachments: LUCENE-9093.patch
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> When using the unified highlighter with hl.bs.type=WORD, I am not able to get 
> context to the left of the matches returned; only words to the right of each 
> match are shown.  I see this behaviour on both Solr 6.4 and Solr 7.1.
> Without context to the left of a match, the highlighted snippets are much 
> less useful for understanding where the match appears in a document.
> As an example, using the techproducts data with Solr 7.1, given a search for 
> "apple", highlighting the "features" field:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified
> I see this snippet:
> "Apple Lossless, H.264 video"
> Note that "Apple" is anchored to the left.  Compare with the original 
> highlighter:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30
> And the match has context either side:
> ", Audible, Apple Lossless, H.264 video"
> (To complicate this, in general I am not sure that the unified highlighter is 
> respecting the hl.fragsize parameter, although [SOLR-9935] suggests support 
> was added.  I included the hl.fragsize param in the unified URL too, but it's 
> making no difference unless set to 0.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left

2019-12-30 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005403#comment-17005403
 ] 

David Smiley commented on LUCENE-9093:
--

The PR is concluding and I want to summarize here:

Proposed LUCENE CHANGES.txt:

(two places so we draw attention to something)

API Changes:

LUCENE-9093: Not an API change but a change in behavior of the 
UnifiedHighlighter's LengthGoalBreakIterator that will yield Passages sized a 
little different due to the fact that the sizing pivot is now the center of the 
first match and not its left edge.

Improvements:

LUCENE-9093: UnifiedHighlighter's LengthGoalBreakIterator has a new 
fragmentAlighnment option to better center the first match in the passage.  
Also the sizing point now pivots at the center of the first match term and not 
its left edge.  This yileds Passages that won't be identical to the previous 
behavior. (Nándor Mátravölgyi, David Smiley)

Proposed SOLR CHANGES.txt:

Improvements:

LUCENE-9093: The Unified highlighter has two new passage sizing parameters, 
hl.fragAlignRatio and hl.fragsizeIsMinimum, with defaults that aim to better 
center matches in fragments than previously.  See the ref guide.  Regardless of 
the settings, the passages may be sized differently than before. (Nándor 
Mátravölgyi, David Smiley)

> Unified highlighter with word separator never gives context to the left
> ---
>
> Key: LUCENE-9093
> URL: https://issues.apache.org/jira/browse/LUCENE-9093
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: Tim Retout
>Priority: Major
> Attachments: LUCENE-9093.patch
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> When using the unified highlighter with hl.bs.type=WORD, I am not able to get 
> context to the left of the matches returned; only words to the right of each 
> match are shown.  I see this behaviour on both Solr 6.4 and Solr 7.1.
> Without context to the left of a match, the highlighted snippets are much 
> less useful for understanding where the match appears in a document.
> As an example, using the techproducts data with Solr 7.1, given a search for 
> "apple", highlighting the "features" field:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified
> I see this snippet:
> "Apple Lossless, H.264 video"
> Note that "Apple" is anchored to the left.  Compare with the original 
> highlighter:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30
> And the match has context either side:
> ", Audible, Apple Lossless, H.264 video"
> (To complicate this, in general I am not sure that the unified highlighter is 
> respecting the hl.fragsize parameter, although [SOLR-9935] suggests support 
> was added.  I included the hl.fragsize param in the unified URL too, but it's 
> making no difference unless set to 0.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left

2019-12-24 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002839#comment-17002839
 ] 

David Smiley commented on LUCENE-9093:
--

In this situation, do it in two phases.  Phase 1 is what you have already and 
would be merged to both branches.  Phase 2 changes the default and only merges 
to master branch.  This can all happen under this issue ID.

> Unified highlighter with word separator never gives context to the left
> ---
>
> Key: LUCENE-9093
> URL: https://issues.apache.org/jira/browse/LUCENE-9093
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: Tim Retout
>Priority: Major
> Attachments: LUCENE-9093.patch
>
>
> When using the unified highlighter with hl.bs.type=WORD, I am not able to get 
> context to the left of the matches returned; only words to the right of each 
> match are shown.  I see this behaviour on both Solr 6.4 and Solr 7.1.
> Without context to the left of a match, the highlighted snippets are much 
> less useful for understanding where the match appears in a document.
> As an example, using the techproducts data with Solr 7.1, given a search for 
> "apple", highlighting the "features" field:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified
> I see this snippet:
> "Apple Lossless, H.264 video"
> Note that "Apple" is anchored to the left.  Compare with the original 
> highlighter:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30
> And the match has context either side:
> ", Audible, Apple Lossless, H.264 video"
> (To complicate this, in general I am not sure that the unified highlighter is 
> respecting the hl.fragsize parameter, although [SOLR-9935] suggests support 
> was added.  I included the hl.fragsize param in the unified URL too, but it's 
> making no difference unless set to 0.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left

2019-12-24 Thread Lucene/Solr QA (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002807#comment-17002807
 ] 

Lucene/Solr QA commented on LUCENE-9093:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
30s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  0m 32s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} Check forbidden APIs {color} | {color:red} 
 0m 23s{color} | {color:red} Check forbidden APIs check-forbidden-apis failed 
{color} |
| {color:red}-1{color} | {color:red} Validate source patterns {color} | 
{color:red}  0m 23s{color} | {color:red} Check forbidden APIs 
check-forbidden-apis failed {color} |
| {color:red}-1{color} | {color:red} Validate ref guide {color} | {color:red}  
0m 23s{color} | {color:red} Check forbidden APIs check-forbidden-apis failed 
{color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
40s{color} | {color:green} highlighter in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 45m 
59s{color} | {color:green} core in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  5m 
37s{color} | {color:green} solrj in the patch passed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 56m 20s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | LUCENE-9093 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12989051/LUCENE-9093.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  validaterefguide  |
| uname | Linux lucene1-us-west 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 
10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / 72c99e921c4 |
| ant | version: Apache Ant(TM) version 1.10.5 compiled on March 28 2019 |
| Default Java | LTS |
| Check forbidden APIs | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/245/artifact/out/patch-check-forbidden-apis-root.txt
 |
| Validate source patterns | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/245/artifact/out/patch-check-forbidden-apis-root.txt
 |
| Validate ref guide | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/245/artifact/out/patch-check-forbidden-apis-root.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/245/testReport/ |
| modules | C: lucene lucene/highlighter solr/core solr/solrj 
solr/solr-ref-guide U: . |
| Console output | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/245/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> Unified highlighter with word separator never gives context to the left
> ---
>
> Key: LUCENE-9093
> URL: https://issues.apache.org/jira/browse/LUCENE-9093
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: Tim Retout
>Priority: Major
> Attachments: LUCENE-9093.patch
>
>
> When using the unified highlighter with hl.bs.type=WORD, I am not able to get 
> context to the left of the matches returned; only words to the right of each 
> match are shown.  I see this behaviour on both Solr 6.4 and Solr 7.1.
> Without context to the left of a match, the highlighted snippets are much 
> less useful for understanding where the match appears in a document.
> As an example, using the techproducts data with Solr 7.1, given a search for 
> "apple", highlighting the "features" field:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified
> I see this snippet:
> "Apple Lossless, H.264 video"
> Note that "Apple" is anchored to the left.  Compare with the original 

[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left

2019-12-24 Thread Jira


[ 
https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002740#comment-17002740
 ] 

Nándor Mátravölgyi commented on LUCENE-9093:


How should I make pull requests with the different default fragalign [~dsmiley] 
?

My guess is that the PR to master should have the default fragalign of 0.5 (and 
modified docs), while I also make a PR to the 8x and 7x branch with the 
original patch. This way after the master is accepted and merged to the others, 
their PR can be accepted and cherry picked on the differences.

I'll wait for your input on this.

> Unified highlighter with word separator never gives context to the left
> ---
>
> Key: LUCENE-9093
> URL: https://issues.apache.org/jira/browse/LUCENE-9093
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: Tim Retout
>Priority: Major
> Attachments: LUCENE-9093.patch
>
>
> When using the unified highlighter with hl.bs.type=WORD, I am not able to get 
> context to the left of the matches returned; only words to the right of each 
> match are shown.  I see this behaviour on both Solr 6.4 and Solr 7.1.
> Without context to the left of a match, the highlighted snippets are much 
> less useful for understanding where the match appears in a document.
> As an example, using the techproducts data with Solr 7.1, given a search for 
> "apple", highlighting the "features" field:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified
> I see this snippet:
> "Apple Lossless, H.264 video"
> Note that "Apple" is anchored to the left.  Compare with the original 
> highlighter:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30
> And the match has context either side:
> ", Audible, Apple Lossless, H.264 video"
> (To complicate this, in general I am not sure that the unified highlighter is 
> respecting the hl.fragsize parameter, although [SOLR-9935] suggests support 
> was added.  I included the hl.fragsize param in the unified URL too, but it's 
> making no difference unless set to 0.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left

2019-12-23 Thread Jira


[ 
https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002548#comment-17002548
 ] 

Nándor Mátravölgyi commented on LUCENE-9093:


I could look into making this a github PR tomorrow... I'll change the default 
fragalign to 0.5 as well.

It also works in SENTENCE mode, but the results won't be as accurate in some 
cases. Let me elaborate.

In any mode the selected BreakIterator (WORD, SEPARATOR, SENTENCE, etc.) makes 
the decision on where a slice can happen. The first slice always contains the 
match. The LengthGoalBreakIterator will decide which side of the first slice 
should the selected BI add more slices to. The logic is generic and will work 
regardless of the underlying BI. Since the snippet will be grown until it 
reaches fragsize, the size of the last slice to be added will determine how big 
to overshoot is. Examples in SENTENCE mode:

Example text: _Hello Susan! I cannot believe the weather is unreal again! The 
sky is green. I hope Mrs Smith will bring an umbrella for the picnic. Let's not 
panic._
 # If the fragsize is smaller than the first slice (sentence in this case), no 
expansion will happen in either direction. Note that fragalign is N/A in this 
case.

{noformat}
q=sky=0.5=10 makes snippet length of 17
The sky is green.{noformat}

 # If the fragsize is bigger than the first slice and the fragalign is 0.5, the 
slice will be expanded on the left first and then on the right if any space is 
left.

{noformat}
q=sky=0.5=30 makes snippet length of 63
I cannot believe the weather is unreal again! The sky is green.

q=sky=0.5=80 makes snippet length of 119
I cannot believe the weather is unreal again! The sky is green. I hope 
Mrs Smith will bring an umbrella for the picnic.

q=sky=0.5=120 makes snippet length of 132
Hello Susan! I cannot believe the weather is unreal again! The sky is 
green. I hope Mrs Smith will bring an umbrella for the picnic.{noformat}

 # If the fragsize is bigger than the first slice and the fragalign is 0, the 
slice will be expanded on the right only. (the match is anchored to 
0/left/begin)

{noformat}
q=sky=0.0=30 makes snippet length of 73
The sky is green. I hope Mrs Smith will bring an umbrella for the picnic.

q=sky=0.0=80 makes snippet length of 90
The sky is green. I hope Mrs Smith will bring an umbrella for the 
picnic. Let's not panic.{noformat}

 # If the fragsize is bigger than the first slice and the fragalign is 1, the 
slice will be expanded on the left only. (the match is anchored to 1/right/end)

{noformat}
q=sky=1.0=30 makes snippet length of 63
I cannot believe the weather is unreal again! The sky is green.

q=sky=1.0=70 makes snippet length of 76
Hello Susan! I cannot believe the weather is unreal again! The sky is 
green.{noformat}

In the above examples there are big overshoots of the fragsize. 63 instead of 
30 (+110%) and 119 instead of 80 (+49%). These would also occur if the 
fragalign would be 0.1, but the alignment would be even less accurate in cases 
where the left expansion overshoots:
{noformat}
q=sky=0.1=30 makes snippet length of 63
I cannot believe the weather is unreal again! The sky is green.{noformat}
This is because the order of expansion is strictly left first. I guess this 
could be improved if so desired.

In summary, to ensure the accuracy of fragsize & fragalign parameters, they 
have to be proportional to the approximate size of the slices. Here's how the 
worst expected overshoot can be calculated:
{noformat}
float WorstOvershootPercent(float fragsize, float avgSliceLength) {
return fragsize-1)+avgSliceLength) / fragsize)-1)*100;
}

WORD: (words are usually 12-25 characters most)
WorstOvershootPercent(15, 12)=>  73.34%
WorstOvershootPercent(100, 25)   =>  24.00%
WorstOvershootPercent(300, 25)   =>   8.00%

SENTENCE: (a sentence can be very long)
WorstOvershootPercent(300, 300)  =>  99.66%
WorstOvershootPercent(300, 500)  => 166.34%
WorstOvershootPercent(2000, 300) =>  14.95%
WorstOvershootPercent(2000, 500) =>  24.95%{noformat}
The other highlighters have similar rules for this. The only thing that can 
improve this easily in some cases, is to search the closest length to the 
fragsize instead of the minimum. The LengthGoalBreakIterator has a 
closestTo-mode, but it's not usable because it would require yet another 
parameter. ([view on 
github|https://github.com/apache/lucene-solr/blob/1be5b689640fe4d1bf0ae3fd19c5fe93b20a77ef/solr/core/src/java/org/apache/solr/highlight/UnifiedSolrHighlighter.java#L330])

Using that mode could make an undershoot that is closer to the desired size 
than the overshoot.

> Unified highlighter with word separator never gives context to the left
> ---
>
> Key: LUCENE-9093
> URL: https://issues.apache.org/jira/browse/LUCENE-9093
> Project: Lucene - 

[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left

2019-12-23 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002506#comment-17002506
 ] 

David Smiley commented on LUCENE-9093:
--

Sorry for the delay.  Can you please post a PR as it's more conducive to the 
code review process?

I have a question about this setting.  You've declared the benefits of it for a 
{{hl.bs.type=WORD}} but would this also be helpful for SENTENCE too?  I hope so.

I think in 9.0 the {{hl.fragalign}} setting should default to {{0.5}} or maybe 
{{0.25}}

> Unified highlighter with word separator never gives context to the left
> ---
>
> Key: LUCENE-9093
> URL: https://issues.apache.org/jira/browse/LUCENE-9093
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: Tim Retout
>Priority: Major
> Attachments: LUCENE-9093.patch
>
>
> When using the unified highlighter with hl.bs.type=WORD, I am not able to get 
> context to the left of the matches returned; only words to the right of each 
> match are shown.  I see this behaviour on both Solr 6.4 and Solr 7.1.
> Without context to the left of a match, the highlighted snippets are much 
> less useful for understanding where the match appears in a document.
> As an example, using the techproducts data with Solr 7.1, given a search for 
> "apple", highlighting the "features" field:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified
> I see this snippet:
> "Apple Lossless, H.264 video"
> Note that "Apple" is anchored to the left.  Compare with the original 
> highlighter:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30
> And the match has context either side:
> ", Audible, Apple Lossless, H.264 video"
> (To complicate this, in general I am not sure that the unified highlighter is 
> respecting the hl.fragsize parameter, although [SOLR-9935] suggests support 
> was added.  I included the hl.fragsize param in the unified URL too, but it's 
> making no difference unless set to 0.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left

2019-12-17 Thread Jira


[ 
https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998660#comment-16998660
 ] 

Nándor Mátravölgyi commented on LUCENE-9093:


I'm back with a patch! [^LUCENE-9093.patch]

This adds a `hl.fragalign` parameter to the Unified Highlighter. I've added a 
description about it in the docs on how it works. I've also updated the related 
tests. I've opted to keep the new feature backward-compatible. From the new 
docs:
{noformat}
Fragment alignment can influence where the match in a passage is positioned. 
This floating point value is used to break the remaining `hl.fragsize` of the 
passage around the match. The default value of `0.0` means to align the match 
to the left, this is the backward-compatible setting. A value of `0.5` would 
mean that equal amount of text should be around the match on both sides, while 
`1.0` to align it to the right. Note: there are situations where the requested 
alignment is not plausible. This depends on the length of the match, the used 
breakiterator and the text content around the match.

Before the introduction of this parameter all passages had left-aligned 
matches. Changing the `hl.bs.type` to `WORD` and the `hl.fragalign` to `0.5` 
will make results that closely resemble what the other highlighters produce by 
default.
{noformat}
I must say that I've changed my mind about the abstraction. A proper one 
instead of the chained BreakIterators would be much nicer. The 
LengthGoalBreakIterator already had a few behavioral differences to how a 
generic BreakIterator works. This change makes it work even less like a 
BreakIterator. It should be totally fine in it's specifically crafted universe. 
However a better abstraction/structure would be required if we want 
style-points as well. The difficulty is that the chaining of the BreakItartors 
would need a refactor which has far greater scope than this issue for example.

> Unified highlighter with word separator never gives context to the left
> ---
>
> Key: LUCENE-9093
> URL: https://issues.apache.org/jira/browse/LUCENE-9093
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: Tim Retout
>Priority: Major
> Attachments: LUCENE-9093.patch
>
>
> When using the unified highlighter with hl.bs.type=WORD, I am not able to get 
> context to the left of the matches returned; only words to the right of each 
> match are shown.  I see this behaviour on both Solr 6.4 and Solr 7.1.
> Without context to the left of a match, the highlighted snippets are much 
> less useful for understanding where the match appears in a document.
> As an example, using the techproducts data with Solr 7.1, given a search for 
> "apple", highlighting the "features" field:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified
> I see this snippet:
> "Apple Lossless, H.264 video"
> Note that "Apple" is anchored to the left.  Compare with the original 
> highlighter:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30
> And the match has context either side:
> ", Audible, Apple Lossless, H.264 video"
> (To complicate this, in general I am not sure that the unified highlighter is 
> respecting the hl.fragsize parameter, although [SOLR-9935] suggests support 
> was added.  I included the hl.fragsize param in the unified URL too, but it's 
> making no difference unless set to 0.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left

2019-12-16 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997726#comment-16997726
 ] 

David Smiley commented on LUCENE-9093:
--

bq. I'd think trying to completely avoid the overlaps is preferable. 

Ah, I agree.  This would probably involve only a very minor change to 
{{FieldHighlighter#highlightOffsetsEnums}}.

You gave two example of "worst edge cases" and the only variance of input was 
wether the fragsize was 50 or 60.  I don't see that it matters which looks 
nicer since the search app developer can't code to every case; he/she will pick 
60 vs 50 vs whatever depending on how much space is available for showing 
snippets.  Whichever is chosen, sometimes the match alignment in the snippet 
will be poor (not centered at all).  Any way to answer your subjective question 
of which looks nicest, I suppose the second one looks nicer -- the one with 60 
one snippet and second field match at the right end.


> Unified highlighter with word separator never gives context to the left
> ---
>
> Key: LUCENE-9093
> URL: https://issues.apache.org/jira/browse/LUCENE-9093
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: Tim Retout
>Priority: Major
>
> When using the unified highlighter with hl.bs.type=WORD, I am not able to get 
> context to the left of the matches returned; only words to the right of each 
> match are shown.  I see this behaviour on both Solr 6.4 and Solr 7.1.
> Without context to the left of a match, the highlighted snippets are much 
> less useful for understanding where the match appears in a document.
> As an example, using the techproducts data with Solr 7.1, given a search for 
> "apple", highlighting the "features" field:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified
> I see this snippet:
> "Apple Lossless, H.264 video"
> Note that "Apple" is anchored to the left.  Compare with the original 
> highlighter:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30
> And the match has context either side:
> ", Audible, Apple Lossless, H.264 video"
> (To complicate this, in general I am not sure that the unified highlighter is 
> respecting the hl.fragsize parameter, although [SOLR-9935] suggests support 
> was added.  I included the hl.fragsize param in the unified URL too, but it's 
> making no difference unless set to 0.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left

2019-12-15 Thread Jira


[ 
https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16996736#comment-16996736
 ] 

Nándor Mátravölgyi commented on LUCENE-9093:


We have the same idea how the chained breakiterators could be used to align the 
match in a more pleasing way. I also agree that some changes to 
FieldHighlighter will be necessary to handle overlaps. Your suggestion of that 
is about also highlighting the matches that were included in a previous 
Passage. I'd think trying to completely avoid the overlaps is preferable. That 
would make the snippets not redundant and implicitly solve the issue of needing 
to highlight some matches more than one time.

These are examples of what the least favorable edge cases would look like when 
we strictly avoid overlaps, but want to have centered match alignment. The 
search query is "field" and the original text is:

 
{noformat}
If set to false, or if there is no match in the alternate field either, the 
alternate field will be shown without highlighting, but could be marked by 
other processors.{noformat}
If the search has fragsize around 50 the first "field" word will be aligned 
properly. The next one will be left-aligned because the preceding text has 
already been used for a passage.

 

 
{noformat}
[
  "in the alternate field either, the alternate",
  "field will be shown without highlighting, but"
]{noformat}
 

If the search has fragsize around 60 the first "field" word will be aligned 
properly. The next one will be right-aligned because it is at the very end of 
the passage made for the first match.
{noformat}
[
 "match in the alternate field either, the alternate field"
]{noformat}
Now the question is: which of these is closer to what we want to see? I'd say 
either "worst" edge case would be much better than the constantly left-aligned 
matches we have currently. Note: these are close to how the other highlighters 
behave when they have near-boundary matches.

Regarding the question of abstraction. I've not found a reason to think we need 
to replace the breakitartors with a new interface. I think the bulk of the 
fastVector's fragment builder abstraction is about tracking the matches and 
highlighting the terms with different styles. (note I've only looked through it 
briefly)

Just for the sake of completeness, I'll tell you that for what I would like to 
do, a different concept of fragment length and snippet limit would be better. 
In all honesty I want an excerpt of the document that shows valuable matches in 
the context of a few words around them, while the whole highlight is no longer 
than N characters. Right now I have the configuration of fragsize=90 and 
snippets=3 because I want something that's not longer than 300 chars. If the 
highlighter could determine what differently sized fragments would yield the 
best excerpt, that would be the "best". A dense cluster of matches could form a 
180 chars fragment while two singular matches would form two 50 chars fragment. 
This could be better than forcing the fragments to be uniform in size.

> Unified highlighter with word separator never gives context to the left
> ---
>
> Key: LUCENE-9093
> URL: https://issues.apache.org/jira/browse/LUCENE-9093
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: Tim Retout
>Priority: Major
>
> When using the unified highlighter with hl.bs.type=WORD, I am not able to get 
> context to the left of the matches returned; only words to the right of each 
> match are shown.  I see this behaviour on both Solr 6.4 and Solr 7.1.
> Without context to the left of a match, the highlighted snippets are much 
> less useful for understanding where the match appears in a document.
> As an example, using the techproducts data with Solr 7.1, given a search for 
> "apple", highlighting the "features" field:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified
> I see this snippet:
> "Apple Lossless, H.264 video"
> Note that "Apple" is anchored to the left.  Compare with the original 
> highlighter:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30
> And the match has context either side:
> ", Audible, Apple Lossless, H.264 video"
> (To complicate this, in general I am not sure that the unified highlighter is 
> respecting the hl.fragsize parameter, although [SOLR-9935] suggests support 
> was added.  I included the hl.fragsize param in the unified URL too, but it's 
> making no difference unless set to 0.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] [Commented] (LUCENE-9093) Unified highlighter with word separator never gives context to the left

2019-12-13 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995916#comment-16995916
 ] 

David Smiley commented on LUCENE-9093:
--

This is a very thoughtful response [~myusername8].  I'm really glad you are 
willing to contribute :-)

An idea I have thought of before is to try to get more leading context before 
the first word.  Basically compute half the fragsize as the amount of leading 
text we'd like (configurable ratio).  Then keep looping over sub-BreakIterator 
calls to preceding() until we reach this target.  Strictly speaking, the 
BreakIterator generically has no concept of a highlighting "match" but these 
special-purpose BreakIterators are used in the concept of the 
UnifiedHighlighter and know that when preceding() is called, it's at the first 
match of a passage.  WDYT?  Unfortunately I think it would yield Passages that 
overlap, and that subsequent Passages would not contain the matches of the 
previous overlapping passages. :-/. Maybe this could be overcome by 
FieldHighlighter detecting this and adding the pertinent matches from the most 
recent Passage. 

I'm aware that the use of BreakIterator is limiting, constraining our solution 
space.  And it puts undo extra work on us to implement the JDK defined 
abstraction.  Perhaps like the FVH, the UH needs it's own abstraction here.  CC 
[~romseygeek]

> Unified highlighter with word separator never gives context to the left
> ---
>
> Key: LUCENE-9093
> URL: https://issues.apache.org/jira/browse/LUCENE-9093
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: Tim Retout
>Priority: Major
>
> When using the unified highlighter with hl.bs.type=WORD, I am not able to get 
> context to the left of the matches returned; only words to the right of each 
> match are shown.  I see this behaviour on both Solr 6.4 and Solr 7.1.
> Without context to the left of a match, the highlighted snippets are much 
> less useful for understanding where the match appears in a document.
> As an example, using the techproducts data with Solr 7.1, given a search for 
> "apple", highlighting the "features" field:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=WORD=30=unified
> I see this snippet:
> "Apple Lossless, H.264 video"
> Note that "Apple" is anchored to the left.  Compare with the original 
> highlighter:
> http://localhost:8983/solr/techproducts/select?hl.fl=features=on=apple=30
> And the match has context either side:
> ", Audible, Apple Lossless, H.264 video"
> (To complicate this, in general I am not sure that the unified highlighter is 
> respecting the hl.fragsize parameter, although [SOLR-9935] suggests support 
> was added.  I included the hl.fragsize param in the unified URL too, but it's 
> making no difference unless set to 0.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org