[jira] [Updated] (LUCENE-7620) UnifiedHighlighter: add target character width BreakIterator wrapper

2017-01-06 Thread David Smiley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated LUCENE-7620:
-
Attachment: LUCENE_7620_UH_LengthGoalBreakIterator.patch

Here's an update to the patch mostly related to testing to clarify what's being 
tested. And I did the {{createClosestToLength}} rename.

> UnifiedHighlighter: add target character width BreakIterator wrapper
> 
>
> Key: LUCENE-7620
> URL: https://issues.apache.org/jira/browse/LUCENE-7620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Assignee: David Smiley
> Fix For: 6.4
>
> Attachments: LUCENE_7620_UH_LengthGoalBreakIterator.patch, 
> LUCENE_7620_UH_LengthGoalBreakIterator.patch, 
> LUCENE_7620_UH_LengthGoalBreakIterator.patch
>
>
> The original Highlighter includes a {{SimpleFragmenter}} that delineates 
> fragments (aka Passages) by a character width.  The default is 100 characters.
> It would be great to support something similar for the UnifiedHighlighter.  
> It's useful in its own right and of course it helps users transition to the 
> UH.  I'd like to do it as a wrapper to another BreakIterator -- perhaps a 
> sentence one.  In this way you get back Passages that are a number of 
> sentences so they will look nice instead of breaking mid-way through a 
> sentence.  And you get some control by specifying a target number of 
> characters.  This BreakIterator wouldn't be a general purpose 
> java.text.BreakIterator since it would assume it's called in a manner exactly 
> as the UnifiedHighlighter uses it.  It would probably be compatible with the 
> PostingsHighlighter too.
> I don't propose doing this by default; besides, it's easy enough to pick your 
> BreakIterator config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7620) UnifiedHighlighter: add target character width BreakIterator wrapper

2017-01-06 Thread David Smiley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated LUCENE-7620:
-
Fix Version/s: 6.4

> UnifiedHighlighter: add target character width BreakIterator wrapper
> 
>
> Key: LUCENE-7620
> URL: https://issues.apache.org/jira/browse/LUCENE-7620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Assignee: David Smiley
> Fix For: 6.4
>
> Attachments: LUCENE_7620_UH_LengthGoalBreakIterator.patch, 
> LUCENE_7620_UH_LengthGoalBreakIterator.patch
>
>
> The original Highlighter includes a {{SimpleFragmenter}} that delineates 
> fragments (aka Passages) by a character width.  The default is 100 characters.
> It would be great to support something similar for the UnifiedHighlighter.  
> It's useful in its own right and of course it helps users transition to the 
> UH.  I'd like to do it as a wrapper to another BreakIterator -- perhaps a 
> sentence one.  In this way you get back Passages that are a number of 
> sentences so they will look nice instead of breaking mid-way through a 
> sentence.  And you get some control by specifying a target number of 
> characters.  This BreakIterator wouldn't be a general purpose 
> java.text.BreakIterator since it would assume it's called in a manner exactly 
> as the UnifiedHighlighter uses it.  It would probably be compatible with the 
> PostingsHighlighter too.
> I don't propose doing this by default; besides, it's easy enough to pick your 
> BreakIterator config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7620) UnifiedHighlighter: add target character width BreakIterator wrapper

2017-01-06 Thread David Smiley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated LUCENE-7620:
-
Attachment: LUCENE_7620_UH_LengthGoalBreakIterator.patch

Here's an updated patch.  I added assertions not exceptions because if per 
chance this circumstance happens in production, it's really okay to return 
possibly the wrong break and have a passage that isn't quite the ideal size 
rather than throw some exception.

It now has 2 modes of operation, with 2 corresponding factory methods to 
clarify which: {{createMinLength(...)}} and {{createTargetLength(...)}}.  The 
minLength mode might be useful because it's faster (than target).  I think it's 
more useful than a MaxLength (which still could be added in the future) because 
a too-long passage can possibly be trimmed by the client, but the reverse is 
not true -- you can't lengthen a passage that is too short (if it reaches the 
client talking to a search server).

I did some benchmarking too; which in addition to observing the overhead also 
served to help ensure it didn't throw exceptions (at least for the test queries 
& test data).  That never happened though; I squashed bugs in the test and 
chose sizes to tease out the edge conditions.  In so doing I found a minor bug 
with CustomSeparatorBreakIterator but I'll leave that for another time.  
Benchmarking showed the minLength is noticeably faster than targetLength, maybe 
10% overall.  Also, (something I already knew) I observed a "cheap" underlying 
BreakIterator like CustomSeparatorBreakIterator is ~20% faster than a JDK 
Sentence one.

I'll commit it this weekend or possibly tonight if you review it in-time 
positively.

> UnifiedHighlighter: add target character width BreakIterator wrapper
> 
>
> Key: LUCENE-7620
> URL: https://issues.apache.org/jira/browse/LUCENE-7620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Assignee: David Smiley
> Attachments: LUCENE_7620_UH_LengthGoalBreakIterator.patch, 
> LUCENE_7620_UH_LengthGoalBreakIterator.patch
>
>
> The original Highlighter includes a {{SimpleFragmenter}} that delineates 
> fragments (aka Passages) by a character width.  The default is 100 characters.
> It would be great to support something similar for the UnifiedHighlighter.  
> It's useful in its own right and of course it helps users transition to the 
> UH.  I'd like to do it as a wrapper to another BreakIterator -- perhaps a 
> sentence one.  In this way you get back Passages that are a number of 
> sentences so they will look nice instead of breaking mid-way through a 
> sentence.  And you get some control by specifying a target number of 
> characters.  This BreakIterator wouldn't be a general purpose 
> java.text.BreakIterator since it would assume it's called in a manner exactly 
> as the UnifiedHighlighter uses it.  It would probably be compatible with the 
> PostingsHighlighter too.
> I don't propose doing this by default; besides, it's easy enough to pick your 
> BreakIterator config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7620) UnifiedHighlighter: add target character width BreakIterator wrapper

2017-01-05 Thread David Smiley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated LUCENE-7620:
-
Attachment: LUCENE_7620_UH_LengthGoalBreakIterator.patch

Here's a patch.  I'm calling it {{LengthGoalBreakIterator}}.  In time, perhaps 
we might add some tweaks like a "slop" akin to the LuceneRegexFragmenter (in 
Solr). 

[~jim.ferenczi] I thought you might want to take a peek.  I figure this can get 
into 6.4; I'll commit it this weekend.

> UnifiedHighlighter: add target character width BreakIterator wrapper
> 
>
> Key: LUCENE-7620
> URL: https://issues.apache.org/jira/browse/LUCENE-7620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Assignee: David Smiley
> Attachments: LUCENE_7620_UH_LengthGoalBreakIterator.patch
>
>
> The original Highlighter includes a {{SimpleFragmenter}} that delineates 
> fragments (aka Passages) by a character width.  The default is 100 characters.
> It would be great to support something similar for the UnifiedHighlighter.  
> It's useful in its own right and of course it helps users transition to the 
> UH.  I'd like to do it as a wrapper to another BreakIterator -- perhaps a 
> sentence one.  In this way you get back Passages that are a number of 
> sentences so they will look nice instead of breaking mid-way through a 
> sentence.  And you get some control by specifying a target number of 
> characters.  This BreakIterator wouldn't be a general purpose 
> java.text.BreakIterator since it would assume it's called in a manner exactly 
> as the UnifiedHighlighter uses it.  It would probably be compatible with the 
> PostingsHighlighter too.
> I don't propose doing this by default; besides, it's easy enough to pick your 
> BreakIterator config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org