[jira] [Updated] (LUCENE-7620) UnifiedHighlighter: add target character width BreakIterator wrapper
[ https://issues.apache.org/jira/browse/LUCENE-7620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Smiley updated LUCENE-7620: - Attachment: LUCENE_7620_UH_LengthGoalBreakIterator.patch Here's an update to the patch mostly related to testing to clarify what's being tested. And I did the {{createClosestToLength}} rename. > UnifiedHighlighter: add target character width BreakIterator wrapper > > > Key: LUCENE-7620 > URL: https://issues.apache.org/jira/browse/LUCENE-7620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Assignee: David Smiley > Fix For: 6.4 > > Attachments: LUCENE_7620_UH_LengthGoalBreakIterator.patch, > LUCENE_7620_UH_LengthGoalBreakIterator.patch, > LUCENE_7620_UH_LengthGoalBreakIterator.patch > > > The original Highlighter includes a {{SimpleFragmenter}} that delineates > fragments (aka Passages) by a character width. The default is 100 characters. > It would be great to support something similar for the UnifiedHighlighter. > It's useful in its own right and of course it helps users transition to the > UH. I'd like to do it as a wrapper to another BreakIterator -- perhaps a > sentence one. In this way you get back Passages that are a number of > sentences so they will look nice instead of breaking mid-way through a > sentence. And you get some control by specifying a target number of > characters. This BreakIterator wouldn't be a general purpose > java.text.BreakIterator since it would assume it's called in a manner exactly > as the UnifiedHighlighter uses it. It would probably be compatible with the > PostingsHighlighter too. > I don't propose doing this by default; besides, it's easy enough to pick your > BreakIterator config. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7620) UnifiedHighlighter: add target character width BreakIterator wrapper
[ https://issues.apache.org/jira/browse/LUCENE-7620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Smiley updated LUCENE-7620: - Fix Version/s: 6.4 > UnifiedHighlighter: add target character width BreakIterator wrapper > > > Key: LUCENE-7620 > URL: https://issues.apache.org/jira/browse/LUCENE-7620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Assignee: David Smiley > Fix For: 6.4 > > Attachments: LUCENE_7620_UH_LengthGoalBreakIterator.patch, > LUCENE_7620_UH_LengthGoalBreakIterator.patch > > > The original Highlighter includes a {{SimpleFragmenter}} that delineates > fragments (aka Passages) by a character width. The default is 100 characters. > It would be great to support something similar for the UnifiedHighlighter. > It's useful in its own right and of course it helps users transition to the > UH. I'd like to do it as a wrapper to another BreakIterator -- perhaps a > sentence one. In this way you get back Passages that are a number of > sentences so they will look nice instead of breaking mid-way through a > sentence. And you get some control by specifying a target number of > characters. This BreakIterator wouldn't be a general purpose > java.text.BreakIterator since it would assume it's called in a manner exactly > as the UnifiedHighlighter uses it. It would probably be compatible with the > PostingsHighlighter too. > I don't propose doing this by default; besides, it's easy enough to pick your > BreakIterator config. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7620) UnifiedHighlighter: add target character width BreakIterator wrapper
[ https://issues.apache.org/jira/browse/LUCENE-7620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Smiley updated LUCENE-7620: - Attachment: LUCENE_7620_UH_LengthGoalBreakIterator.patch Here's an updated patch. I added assertions not exceptions because if per chance this circumstance happens in production, it's really okay to return possibly the wrong break and have a passage that isn't quite the ideal size rather than throw some exception. It now has 2 modes of operation, with 2 corresponding factory methods to clarify which: {{createMinLength(...)}} and {{createTargetLength(...)}}. The minLength mode might be useful because it's faster (than target). I think it's more useful than a MaxLength (which still could be added in the future) because a too-long passage can possibly be trimmed by the client, but the reverse is not true -- you can't lengthen a passage that is too short (if it reaches the client talking to a search server). I did some benchmarking too; which in addition to observing the overhead also served to help ensure it didn't throw exceptions (at least for the test queries & test data). That never happened though; I squashed bugs in the test and chose sizes to tease out the edge conditions. In so doing I found a minor bug with CustomSeparatorBreakIterator but I'll leave that for another time. Benchmarking showed the minLength is noticeably faster than targetLength, maybe 10% overall. Also, (something I already knew) I observed a "cheap" underlying BreakIterator like CustomSeparatorBreakIterator is ~20% faster than a JDK Sentence one. I'll commit it this weekend or possibly tonight if you review it in-time positively. > UnifiedHighlighter: add target character width BreakIterator wrapper > > > Key: LUCENE-7620 > URL: https://issues.apache.org/jira/browse/LUCENE-7620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Assignee: David Smiley > Attachments: LUCENE_7620_UH_LengthGoalBreakIterator.patch, > LUCENE_7620_UH_LengthGoalBreakIterator.patch > > > The original Highlighter includes a {{SimpleFragmenter}} that delineates > fragments (aka Passages) by a character width. The default is 100 characters. > It would be great to support something similar for the UnifiedHighlighter. > It's useful in its own right and of course it helps users transition to the > UH. I'd like to do it as a wrapper to another BreakIterator -- perhaps a > sentence one. In this way you get back Passages that are a number of > sentences so they will look nice instead of breaking mid-way through a > sentence. And you get some control by specifying a target number of > characters. This BreakIterator wouldn't be a general purpose > java.text.BreakIterator since it would assume it's called in a manner exactly > as the UnifiedHighlighter uses it. It would probably be compatible with the > PostingsHighlighter too. > I don't propose doing this by default; besides, it's easy enough to pick your > BreakIterator config. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7620) UnifiedHighlighter: add target character width BreakIterator wrapper
[ https://issues.apache.org/jira/browse/LUCENE-7620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Smiley updated LUCENE-7620: - Attachment: LUCENE_7620_UH_LengthGoalBreakIterator.patch Here's a patch. I'm calling it {{LengthGoalBreakIterator}}. In time, perhaps we might add some tweaks like a "slop" akin to the LuceneRegexFragmenter (in Solr). [~jim.ferenczi] I thought you might want to take a peek. I figure this can get into 6.4; I'll commit it this weekend. > UnifiedHighlighter: add target character width BreakIterator wrapper > > > Key: LUCENE-7620 > URL: https://issues.apache.org/jira/browse/LUCENE-7620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Assignee: David Smiley > Attachments: LUCENE_7620_UH_LengthGoalBreakIterator.patch > > > The original Highlighter includes a {{SimpleFragmenter}} that delineates > fragments (aka Passages) by a character width. The default is 100 characters. > It would be great to support something similar for the UnifiedHighlighter. > It's useful in its own right and of course it helps users transition to the > UH. I'd like to do it as a wrapper to another BreakIterator -- perhaps a > sentence one. In this way you get back Passages that are a number of > sentences so they will look nice instead of breaking mid-way through a > sentence. And you get some control by specifying a target number of > characters. This BreakIterator wouldn't be a general purpose > java.text.BreakIterator since it would assume it's called in a manner exactly > as the UnifiedHighlighter uses it. It would probably be compatible with the > PostingsHighlighter too. > I don't propose doing this by default; besides, it's easy enough to pick your > BreakIterator config. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org