[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-10-31 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623451#comment-15623451
 ] 

Timothy M. Rodriguez commented on LUCENE-7438:
--

Not yet, we have an initial general implementation, but it's lacking tests.  
(We have a customized extension internally that does have tests.)  I've created 
a new ticket https://issues.apache.org/jira/browse/SOLR-9708 with a PR 
containing the initial impl so folks can follow or help the work towards 
finishing it up.  Thanks for asking though, hopefully this gets the ball 
rolling faster.

> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Fix For: 6.3
>
> Attachments: LUCENE-7438.patch, LUCENE_7438_UH_benchmark.patch, 
> LUCENE_7438_UH_benchmark.patch, LUCENE_7438_UH_small_changes.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-10-31 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15622882#comment-15622882
 ] 

Yonik Seeley commented on LUCENE-7438:
--

Is this usable from Solr yet, or should a new issue be opened for that?
A quick grep suggests there's nothing yet:
{code}
find solr -type f | xargs grep -i UnifiedHighlighter
{code}

> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Fix For: 6.3
>
> Attachments: LUCENE-7438.patch, LUCENE_7438_UH_benchmark.patch, 
> LUCENE_7438_UH_benchmark.patch, LUCENE_7438_UH_small_changes.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-10-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15616233#comment-15616233
 ] 

ASF GitHub Bot commented on LUCENE-7438:


Github user Timothy055 closed the pull request at:

https://github.com/apache/lucene-solr/pull/79


> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Fix For: 6.3
>
> Attachments: LUCENE-7438.patch, LUCENE_7438_UH_benchmark.patch, 
> LUCENE_7438_UH_benchmark.patch, LUCENE_7438_UH_small_changes.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-10-08 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558079#comment-15558079
 ] 

David Smiley commented on LUCENE-7438:
--

Thanks for fixing the build Mike!  My bad.

> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Fix For: 6.3
>
> Attachments: LUCENE-7438.patch, LUCENE_7438_UH_benchmark.patch, 
> LUCENE_7438_UH_benchmark.patch, LUCENE_7438_UH_small_changes.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-10-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557694#comment-15557694
 ] 

ASF subversion and git services commented on LUCENE-7438:
-

Commit f247acfab17a5363166eb601cb26243e74ca108c in lucene-solr's branch 
refs/heads/branch_6x from Mike McCandless
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f247acf ]

LUCENE-7438: fix broken build


> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Fix For: 6.3
>
> Attachments: LUCENE-7438.patch, LUCENE_7438_UH_benchmark.patch, 
> LUCENE_7438_UH_benchmark.patch, LUCENE_7438_UH_small_changes.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-10-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557693#comment-15557693
 ] 

ASF subversion and git services commented on LUCENE-7438:
-

Commit 0414570348e5b8c8c3da8b1ad491b1b418a3756a in lucene-solr's branch 
refs/heads/master from Mike McCandless
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0414570 ]

LUCENE-7438: fix broken build


> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Fix For: 6.3
>
> Attachments: LUCENE-7438.patch, LUCENE_7438_UH_benchmark.patch, 
> LUCENE_7438_UH_benchmark.patch, LUCENE_7438_UH_small_changes.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-10-07 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1146#comment-1146
 ] 

ASF subversion and git services commented on LUCENE-7438:
-

Commit 3497a2902c198b8092b4b0352650e58543b296b5 in lucene-solr's branch 
refs/heads/branch_6x from [~dsmiley]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3497a29 ]

LUCENE-7438: Renovate benchmark module's support for highlighting

(cherry picked from commit 5ef60af)


> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Fix For: 6.3
>
> Attachments: LUCENE-7438.patch, LUCENE_7438_UH_benchmark.patch, 
> LUCENE_7438_UH_benchmark.patch, LUCENE_7438_UH_small_changes.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-10-07 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1142#comment-1142
 ] 

ASF subversion and git services commented on LUCENE-7438:
-

Commit 5ef60af9c18624a317d5f3e8e331b2bb83c569db in lucene-solr's branch 
refs/heads/master from [~dsmiley]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5ef60af ]

LUCENE-7438: Renovate benchmark module's support for highlighting


> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Fix For: 6.3
>
> Attachments: LUCENE-7438.patch, LUCENE_7438_UH_benchmark.patch, 
> LUCENE_7438_UH_benchmark.patch, LUCENE_7438_UH_small_changes.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-10-04 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547461#comment-15547461
 ] 

ASF subversion and git services commented on LUCENE-7438:
-

Commit 4b6794368df373df1f68ccf27f7556914efeb95e in lucene-solr's branch 
refs/heads/branch_6x from [~dsmiley]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4b67943 ]

LUCENE-7438: New UnifiedHighlighter

(cherry picked from commit 722e827)


> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Attachments: LUCENE-7438.patch, LUCENE_7438_UH_benchmark.patch, 
> LUCENE_7438_UH_small_changes.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-10-04 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546521#comment-15546521
 ] 

ASF subversion and git services commented on LUCENE-7438:
-

Commit 722e82712435ecf46c9868137d885484152f749b in lucene-solr's branch 
refs/heads/master from [~dsmiley]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=722e827 ]

LUCENE-7438: New UnifiedHighlighter


> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Attachments: LUCENE-7438.patch, LUCENE_7438_UH_benchmark.patch, 
> LUCENE_7438_UH_small_changes.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-09-30 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537117#comment-15537117
 ] 

Timothy M. Rodriguez commented on LUCENE-7438:
--

After further consideration, it seems best to leave some of the classes common 
between Postings and the Unified highlighters separate.  If we were to use the 
same classes they'd ideally move to a common sub-package that both could share 
and this would introduce unneeded change and hurt potential compatibility for 
any users of those classes.  Keeping them separate also allows for a possible 
improvement to the method highlightFieldsAsObjects which internally creates a 
Map that is promptly thrown away again in the highlight methods.  I briefly 
investigated changing this to return the internal Object[][] array and avoid 
the extra Map allocation, but this creates some awkwardness since the 
Object[][] array sorts the input fields before filling the arrays, which would 
make the API somewhat of a trap for callers.  This undesired behavior is likely 
why the map is being created.  One way to fix this is to generify 
PassageFormatter over it's output type which would allow for a 
PassageFormatter in the case of the DefaultPassageFormatter.  However, 
changing this is a rather involved change that could ultimately result in the 
UnifiedHighlighter itself having a generic type and it was not clear that 
muddying the waters with that right now was a good idea.  However, keeping 
these classes separate will allow for an attempt at that in the future.

In the meantime, I've also pushed a commit to reduce the visibility of the 
MultiTermHighlighting to package protected.  As it stands, I think this patch 
is ready.

> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Attachments: LUCENE_7438_UH_benchmark.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-09-27 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526474#comment-15526474
 ] 

David Smiley commented on LUCENE-7438:
--

BTW I reviewed the W.E.H. and posted a comparison: 
https://github.com/wikimedia/search-highlighter/issues/19

> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Attachments: LUCENE_7438_UH_benchmark.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-09-26 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524120#comment-15524120
 ] 

David Smiley commented on LUCENE-7438:
--

Thanks for your input Ryan!  I have not heard of Wikimedia's "experimental 
highlighter"; maybe I'll go poke around there to see what it's doing.

RE snippet delineation: This is customizable in the PH & UH & FVH via supplying 
a java.text.BreakIterator.

RE snippets with multiple keywords: Yes!  I've had that thought as well and 
some months ago I filed a TODO in my wish list of highlighter features to add a 
"coordination factor" to the UH algorithm (which at the moment is identical to 
the PH).  A simple coord factor would help a lot, but even better would be one 
that also considered term diversity across the multiple passages you ask for 
rather than scoring each separately.  To do this, it might internally track 
it's the leading candidate passage per term, in addition to the PriorityQueue 
it has now.  This would be very low overhead.

> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Attachments: LUCENE_7438_UH_benchmark.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-09-26 Thread Ryan Pedela (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524080#comment-15524080
 ] 

Ryan Pedela commented on LUCENE-7438:
-

I am very happy to see this. I use Elasticsearch, and I currently use the 
[experimental highlighter 
plugin|https://github.com/wikimedia/search-highlighter] for three reasons.

1. It uses either term vectors or postings to increase performance.
2. It has fragment and sentence modes.
3. The sentence mode produces significantly better highlights than the postings 
highlighter in my experience.

I would prefer to use an official highlighter and happy to see that the 
UnifiedHighlighter will take care of #1 and #2. Now I would like to talk about 
#3.

I don't know the specifics of the algorithm, but the experimental highlighter 
appears to take proximity and a keyword's document position into account. One 
example from memory, I had a medical research paper about warfarin and the 
highlight returned by the postings highlighter for the search "warfarin" came 
from the references. However the experimental highlighter returned a highlight 
near the beginning of the paper and it was a pretty good summary of the paper.

There is also room for improvement for both the experimental and postings 
highlighters. They both appear to use the same sentence fragmenter which does 
not do a good job with abbreviations and decimal points. Would something like 
Stanford CoreNLP help?

Also I would like a highlighter that tries to get as many keywords as possible 
into the highlight, at least as a config option. That is hard if only returning 
a single sentence or fragment. However I often want three fragments and I would 
like the union of the three fragments to contain all the keywords or as many as 
possible. For example, I am working on a search engine for SEC filings and a 
user searched "BPL hedgings" during a user test. BPL is the stock ticker for 
Buckeye Partners, and stock tickers are pretty unique within the SEC filings. 
The experimental highlighter returned three fragments with "BPL" but no 
"hedgings" (fast vector highlighter produced similar fragments). The user was 
very confused because they didn't see the word "hedgings" in the highlight and 
thought that keyword wasn't found even though it was. To fix this, I retrieve 
the top 100 fragments and post-process them to find the best 3 fragments which 
contain the most keywords collectively. The post processing is quite naive 
since it does not understand proximity, stemming, etc. I would prefer if Lucene 
or ES did it because it can be much smarter.

> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Attachments: LUCENE_7438_UH_benchmark.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-09-21 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509898#comment-15509898
 ] 

David Smiley commented on LUCENE-7438:
--

BTW just to re-inforce the wide precision in the numbers, note that UH_P & 
UH_PV should theoretically highlight term queries with identical code (PV gets 
internally optimized to P when there are no wildcards) yet it differed 0.61 to 
0.52 (17%).

Quoting myself earlier:
bq. For the surface classes users use: Passage, PassageScorer, 
PassageFormatter, DefaultPassageFormatter. – I don't think it good to have 
users use parts of another highlighter (postingshighlight), which is weird for 
users. I propose copying these with a leading 'U', i.e. UPassage etc. That said 
if others think that's a worse trade-off, it's no big deal to me. Once 
o.a.l.s.ph.Passage's constructor is public, it's possible to do that.

Reconsidering... I think it's fine to use these classes from the 
PostingsHighlighter... after all, these classes are only actually seen by users 
when they want to customize them.  A user can simply call public methods on UH 
and reference zero other classes (getting back strings based on default impls 
of those classes).  If there are no objections to this path, Tim/I can update 
the PR along with the other changes listed in the same comment I just 
referenced.

BTW this highlighter has been in production for about 6 months.  It rooted out 
a couple bugs.

> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Attachments: LUCENE_7438_UH_benchmark.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-09-08 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474731#comment-15474731
 ] 

Timothy M. Rodriguez commented on LUCENE-7438:
--

Actually, I think passage relevancy might be something we'd look into in more 
details down the line.  Definitely, some of the things in LUCENE-4909 could be 
useful. :)  I see merit in keeping things separate to allow for flexibility.

> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-09-08 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474118#comment-15474118
 ] 

David Smiley commented on LUCENE-7438:
--

Thanks for chiming in Rob.

For the present while there are feature gaps (see "Missing features" above), I 
don't think we can suggest that there be only one highlighter.  I admit I see 
that as potential eventuality that I think is desirable, but it's a moot 
discussion right now.  That being said, the UH, being based on the PH, does 
everything it does and more.  It scores/ranks and formats using the same code.  
The very kernel of the highlighter that produces the Passages[] (now in 
FieldHighlighter.highlightOffsetsEnums) is essentially the same.  Still, I 
don't think we should do any removing of highlighters at this time.  
Eventually, we can ask ourselves, what is highlighter XYZ giving us over the 
UnifiedHighlighter?  And then we can see if we (and other users) think it's 
worth keeping it.

RE PostingsHighlighter perf trade-offs:

Yeah I know it's possible to craft an extreme case that would exercise the 
PostingsEnum reuse --  loads of terms in the query and an optimized index.  
Once we have some benchmarking, we can see how much of a hit was lost by not 
re-using.  That feature was retained in the UH for many months until just 
recently when it underwent a large refactor to simplify things.  Other than 
this, I don't believe there are any tricks in the PH that we removed in the UH.

RE ranking/scoring "needs":

I'm not aware that the UH might have different passing scoring "needs" than the 
PH.  The PH's algorithm seems really nice to me; I didn't put any thought into 
this aspect.  But yeah maybe there might be improvements for phrase/span 
queries in particular.  By the way, PhraseHelper, simply filters out certain 
occurrences of certain terms.  Perhaps the frequency of the span might be used 
in scoring?  But to know that, you must iterate them, and then you lose lazy 
iteration.  Perhaps someone wanting to trade-off performance for possibly 
better passage relevance would make this trade-off?  We/BLAW have no plans to 
do that.  If someone comes along with such a requirement, I hope we can 
accommodate that interesting direction.

RE moving / renaming / visibility

If you have specific suggestions (e.g. w.r.t. MultiTermHighlighting) on how 
they might be renamed and re-shuffled to different packages than I'd love to 
hear your thoughts on that.  Some things of the UH are expressly public because 
we/BLAW are using those endpoints but we/BLAW don't use MultiTermHighlighting 
at this time.  But I could imagine some custom wildcard query coming into 
existence and it would be a PITA if we couldn't help MTH understand some new 
query.  Similar for WSTE.

BTW there is a visibility test expressly for ensuring certain things are public.

> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-09-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15473933#comment-15473933
 ] 

Robert Muir commented on LUCENE-7438:
-

I think there are room for plenty of new highlighters in lucene, so its great 
to have another one. I do get the feeling from some issues etc, that some feel 
there "can be only one", but I don't see any good reasons for that. On the 
other hand, like codecs and other things in lucene, we should explore different 
approaches that give the user more choices (like this new highlighter here). 

I think this is especially important because of how "personal" highlighting is 
to the app, and the fact that performance/relevance is tricky stuff here 
depending on how the app works! For example about the reuse note: this 
highlighter discards reuse of some internal lucene structures, but under some 
circumstances (e.g. certain query structures/Directory impl/doc sizes/top-N 
sizes/stopwords or lack thereof) this could indeed matter a lot. For PH it does 
this simply because it tries to maximize perf everywhere (possibly to the 
extreme: perhaps it really is the wrong tradeoff, but that was a "different" 
direction to explore). There are lots of ways these things can perform or be 
very slow, and a lot of it is hard to generalize across all use-cases!

As far as the duplication of classes, I'd be a little careful before 
refactoring too much of it, because of that very reason. Maybe UH needs to 
ultimately go in different directions than PH and we should just let it do that.

For example ranking: PH disregards query structure and tries to use a 
bag-of-words approach with something similar to traditional ranking for that, 
the idea is that hopefully that stuff works well on a small scale too.

But UH might need something else: if it attempts to use more query structure 
than bag-of-words, then UH might need to do something else. I haven't looked to 
see how things like IDF are computed there, that's just an example.

And maybe the right direction for PH, given what it tries to do, is to do 
something like LUCENE-4909... that sorta sits out there because we don't have a 
good way of measuring quality? 

I also do worry a bit about making internal-only classes like 
MultitermHighlighting public to the user, I think this has a real heavy API 
cost. Maybe for that one in particular its the right way to go, given how 
hacky/hairy it is especially. Maybe it can be renamed to something better to 
limit the confusion :) This problem isn't really unique to highlighters though, 
its something that should be addressed better in general, e.g. with 
internal-only packages that are hidden or something like that.

Anyway, these are just some general thoughts. Glad to see we will have more 
choices.

> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-09-08 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15473896#comment-15473896
 ] 

Timothy M. Rodriguez commented on LUCENE-7438:
--

I'm not a fan of forking classes with Uxyz naming scheme.  I think it'd be 
better to make the existing classes re-usable or keep the current naming 
scheme.  That being said, if we make the existing classes re-usable, it might 
be better to plan on moving them into some common package later on so it's 
clearer that they are re-used.

> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-09-08 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15473810#comment-15473810
 ] 

David Smiley commented on LUCENE-7438:
--

I think we can avoid some duplication confusion as follows:

For the internal classes that user's don't normally use:

* *MultiTermHighlighting*: transfer most of the changes I did in 
MultiTermHighlighting to the copy in the {{postingshighlight}} package -- 
particularly to anything that already existed there. Then make that public and 
lucene.internal so it can be accessed.  That is very low-impact on the PH. For 
the couple methods added -- {{uninvertAndFilterTerms}} and 
{{makeStringMatchAutomata}} I think we can add these to FieldOffsetStrategy and 
AnalysisOffsetStrategy respectively.  And add comments mentioning it would 
logically go in MTH but since that's in a different highlighter, we don't.
* *TokenStreamFromTermVector*: I think we can replace the one in the 
{{highlighter}} with this one, as the sparseness ratio is configurable in the 
constructor.

For the surface classes users use: Passage, PassageScorer, PassageFormatter, 
DefaultPassageFormatter.  -- I don't think it good to have users use parts of 
another highlighter ({{postingshighlight}}), which is weird for users.  I 
propose copying these with a leading 'U', i.e. {{UPassage}} etc.  That said if 
others think that's a worse trade-off, it's no big deal to me.  Once 
{{o.a.l.s.ph.Passage}}'s constructor is public, it's possible to do that.

RE benchmarks... not sure when we'll have those ready but I would hope by the 
end of this month.  I figure using our benchmark module on wikipedia is a fine 
way to go.  I've used that to benchmark enhancements to the standard 
highlighter before.

Thoughts (esp. from other committers)?  [~rcmuir], I figure you'll have some 
valuable feedback as you did most (all?) of the herculean work on the 
PostingsHighlighter which was an ideal starting point for this UH.   I know 
some folks are on vacation or at another conference right now who I know want 
to provide feedback so I'm in no hurry to commit anything.

> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-09-07 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15470904#comment-15470904
 ] 

Timothy M. Rodriguez commented on LUCENE-7438:
--

Pull request here: https://github.com/apache/lucene-solr/pull/79

I'd also like to specially acknowledge [~dsmiley] who has worked with us 
closely.  He did a very significant share of the work represented here.

> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-09-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15470875#comment-15470875
 ] 

ASF GitHub Bot commented on LUCENE-7438:


GitHub user Timothy055 opened a pull request:

https://github.com/apache/lucene-solr/pull/79

LUCENE-7438 UnifiedHighlighter

Initial pull request for 
[LUCENE-7438](https://issues.apache.org/jira/browse/LUCENE-7438)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Timothy055/lucene-solr master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/lucene-solr/pull/79.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #79


commit 02e932c4a6146363680b88f4947a693c6697c955
Author: Timothy Rodriguez 
Date:   2016-09-01T19:23:50Z

Initial fork of PostingsHighlighter for UnifiedHighlighter

commit 9d88411b3985a98851384d78d681431dba710e89
Author: Timothy Rodriguez 
Date:   2016-09-01T23:17:06Z

Initial commit of the UnifiedHighlighter for OSS contribution

commit e45e39bc4b07ea33e4423b264c2fefb9aa08777a
Author: David Smiley 
Date:   2016-09-02T12:45:49Z

Fix misc issues; "ant test" now works. (#1)

commit 046a28ef31acf4cea7d2554b827e6a714e3d
Author: Timothy Rodriguez 
Date:   2016-09-02T20:58:31Z

Minor refactoring of the AnalysisFieldHighlighter

commit ccd1a2280abd4b48cfef8122696e5d9cfd12920f
Author: David Smiley 
Date:   2016-09-03T12:55:20Z

AbstractFieldHighlighter: order methods more sensibly; renamed a couple.

commit d4714a04a3e41d5e95bbe942b275c32ed69b9c2e
Author: David Smiley 
Date:   2016-09-04T01:03:29Z

Improve javadocs and @lucene.external/internal labeling & scope.
"ant precommit" now passes.

commit e0659f18a59bf2893076da6d7643ff30f2fa5a52
Author: David Smiley 
Date:   2016-09-04T01:25:55Z

Analysis: remove dubious filter() method

commit ccd7ce707bff2c06da89b31853cca9aecea72008
Author: David Smiley 
Date:   2016-09-04T01:44:01Z

getStrictPhraseHelper -> rm "Strict", getHighlightAccuracy -> getFlags, and 
only call filterExtractedTerms once.

commit ffc2a22c700b8abcbf87673d5d05bb3659d177c9
Author: David Smiley 
Date:   2016-09-04T15:21:08Z

UnifiedHighlighter round 2 (#2)

* AbstractFieldHighlighter: order methods more sensibly; renamed a couple.

* Improve javadocs and @lucene.external/internal labeling & scope.
"ant precommit" now passes.

* Analysis: remove dubious filter() method

* getStrictPhraseHelper -> rm "Strict", getHighlightAccuracy -> getFlags, 
and only call filterExtractedTerms once.

commit 5f95e05595db462d3ab5bffc68c2c92f70875072
Author: David Smiley 
Date:   2016-09-04T16:12:33Z

Refactor: FieldOffsetStrategy

commit 86fb6265fbbdb955ead6d4baf944bf708175715e
Author: David Smiley 
Date:   2016-09-04T16:21:32Z

stop passing maxPassages into highlightFieldForDoc()

commit f6fd80544eae9fab953b94b1e9346c0883f956eb
Author: David Smiley 
Date:   2016-09-04T16:12:33Z

Refactor: FieldOffsetStrategy

commit b335a673c2ce45904890c1e9af7cbfda2bd27b0f
Author: David Smiley 
Date:   2016-09-04T16:21:32Z

stop passing maxPassages into highlightFieldForDoc()

commit 478db9437b92214cbf459f82ba2e3a67c966a150
Author: David Smiley 
Date:   2016-09-04T18:29:44Z

Rename subclasses of FieldOffsetStrategy.

commit dbf4280755c11420a5032445cd618fadb7444b61
Author: David Smiley 
Date:   2016-09-04T18:31:34Z

Re-order and harmonize params on methods called by UH.getFieldHighlighter()

commit f0340e27e61dcda2e11992f08ec07a72fad6c24c
Author: David Smiley 
Date:   2016-09-04T18:53:51Z

FieldHighlighter: harmonize field/param order. And don't apply 
maxNoHighlightPasses twice.

commit 817f63c1d48fd523c13b9c40a2ae9b8a4047209a
Author: Timothy Rodriguez 
Date:   2016-09-06T20:43:20Z

Merge of renaming changes

commit 0f644a4f53c1ed4d41d562848f6fe51a87442a75
Author: Timothy Rodriguez 
Date:   2016-09-06T20:54:13Z

add visibility tests

commit 9171f49e117085e7d086267bb73836831ff07f8e
Author: Timothy Rodriguez 
Date:   2016-09-07T14:26:59Z

ADd additional extensibility test




> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  

[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-09-07 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15470848#comment-15470848
 ] 

Timothy M. Rodriguez commented on LUCENE-7438:
--

Some additional information:

h2. Missing features & possible future improvements:
Despite the offset source flexibility and accuracy options of this highlighter, 
it continues to be the case that some highlighters have unique features.  The 
following features are in the standard Highlighter (and possibly 
FastVectorHighlighter) but are not in the UnifiedHighlighter (and thus not 
PostingsHighlighter either since UH is derived from PH):
* Being able to disable “requireFieldMatch” to thus highlight a query 
insensitive to whatever fields are mentioned in the query.
* Using boosts in the query to weight passages.
* Regex pased passage delineation. Though I’m unsure if anyone cares given the 
existing BreakIterator options available.
Aside from addressing the feature gaps listed above, there are a couple known 
things that would be nice to add:
* The phrase highlighting (implemented by PhraseHelper) could be made more 
accurate, and probably faster too, by using techniques in Alan’s Luwak system 
that uses the Lucene SpanCollector API introduced in Lucene 5.3. It wasn’t done 
this way to begin with because this highlighter was developed originally for 
Lucene 4.10.
* Wildcard queries usually use TokenStreamFromTermVector, which uninverts the 
terms out of a Terms index.  Instead, we now think it would be better to create 
a bunch of PostingsEnum for each matching term. This would bring about some 
simplifications and efficiencies, and can lead to better passage relevancy. A 
bonus would be aggregating terms matching the same automata into a merged 
PostingsEnum that has a freq() based on the sum of the underlying matching 
terms.

h2. Changes from the PostingsHighlighter 
* The UH is more stateful
** Holds the IndexSearcher instead of asking most methods to pass it through.
** Options now have simple setters, and the per-field getters return these. 
This means the common case of a setting being non-specific to a field doesn’t 
require subclassing.
* Multi-valued field handling is improved to ensure that a passage will never 
span across values, plus it honors the positionIncrementGap for an analyzed 
offset source. See MultiValueTokenStream and SplittingBreakIterator.
* The PH caches all content to be highlighted for all docs and then highlights 
it all.  The UH has a limit on this which led to a batching approach.  But if 
all fields use an Analyzer or if more than one use term vectors, then instead 
highlighting happens one doc at a time since the up-front content caching is 
not helpful.
* No longer tries to re-use PostingsEnums (or TermsEnum or LeafReader) from one 
doc to the next. This really simplified some code; it didn’t seem worth it.
* MultiTermHighlighting’s fake PostingsEnum was made Closeable and we close it 
to guard against ramifications of exceptions being thrown during highlighting 
(e.g. a BreakIterator bug or TokenStream bug). Nasty to debug!
* (from standard Highlighter) TokenStreamFromTermVector: optimizations to 
uninvert filtered (thus sparse) Terms.

h2. Non-Core Dependencies
* MemoryIndex: For Analyzer based highlighting when phrases need to be 
highlighted accurately.
* Standard Highlighter things:
** TokenStreamFromTermVector: For most multi-term queries. The UH actually has 
its own derived copy that has been optimized to handle filtered (thus sparse) 
Terms. With further work, we could switch to a different approach and remove it 
(as indicated earlier).  For as long as it stays, it’s also possible to replace 
the existing one with this if we want to do that.
** WeightedSpanTermExtractor: For highlighting phrases accurately to re-use 
it’s SpanQuery conversion and rewrite detecting abilities.  Perhaps these parts 
of WSTE could move to general SpanQuery utilities.
** TermVectorLeafReader: When highlighting offsets from term vectors.
* PostingHighlighter things:
** Technically, Nothing however it has multiple copies of some things that have 
not been modified: Passage, PassageScorer, PassageFormatter, 
DefaultPassageFormatter.
** Note: Utility BreakIterators are of use to the PH, UH, and even the FVH: 
WholeBreakIterator, CustomSeparatorBreakIterator.  Maybe they should move to a 
utils package that isn’t in any of these highlighters?


> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
>