Re: unified highlighter performance in solr 8.5.1

2020-07-05 Thread David Smiley
Here's my PR, which includes some edits to the ref guide docs where I tried
to clarify these settings a little too.
https://github.com/apache/lucene-solr/pull/1651
~ David


On Sat, Jul 4, 2020 at 8:44 AM Nándor Mátravölgyi 
wrote:

> I guess that's fair. Let's have hl.fragsizeIsMinimum=true as default.
>
> On 7/4/20, David Smiley  wrote:
> > I doubt that WORD mode is impacted much by hl.fragsizeIsMinimum in terms
> of
> > quality of the highlight since there are vastly more breaks to pick from.
> > I think that setting is more useful in SENTENCE mode if you can stand the
> > perf hit.  If you agree, then why not just let this one default to
> "true"?
> >
> > We agree on better documenting the perf trade-off.
> >
> > Thanks again for working on these settings, BTW.
> >
> > ~ David
> >
> >
> > On Fri, Jul 3, 2020 at 1:25 PM Nándor Mátravölgyi <
> nandor.ma...@gmail.com>
> > wrote:
> >
> >> Since the issue seems to be affecting the highlighter differently
> >> based on which mode it is using, having different defaults for the
> >> modes could be explored.
> >>
> >> WORD may have the new defaults as it has little effect on performance
> >> and it creates nicer highlights.
> >> SENTENCE should have the defaults that produce reasonable performance.
> >> The docs could document this while also mentioning that the UH's
> >> performance is highly dependent on the underlying Java String/Text?
> >> Iterator.
> >>
> >> One can argue that having different defaults based on mode is
> >> confusing. In this case I think the defaults should be changed to have
> >> the SENTENCE mode perform better. Maybe the options for nice
> >> highlights with WORD mode could be put into the docs in this case as
> >> some form of an example.
> >>
> >> As long as I can use the UH with nicely aligned snippets in WORD mode
> >> I'm fine with any defaults. I explicitly set them in the config and in
> >> the queries most of the time anyways.
> >>
> >
>


Re: unified highlighter performance in solr 8.5.1

2020-07-04 Thread Nándor Mátravölgyi
I guess that's fair. Let's have hl.fragsizeIsMinimum=true as default.

On 7/4/20, David Smiley  wrote:
> I doubt that WORD mode is impacted much by hl.fragsizeIsMinimum in terms of
> quality of the highlight since there are vastly more breaks to pick from.
> I think that setting is more useful in SENTENCE mode if you can stand the
> perf hit.  If you agree, then why not just let this one default to "true"?
>
> We agree on better documenting the perf trade-off.
>
> Thanks again for working on these settings, BTW.
>
> ~ David
>
>
> On Fri, Jul 3, 2020 at 1:25 PM Nándor Mátravölgyi 
> wrote:
>
>> Since the issue seems to be affecting the highlighter differently
>> based on which mode it is using, having different defaults for the
>> modes could be explored.
>>
>> WORD may have the new defaults as it has little effect on performance
>> and it creates nicer highlights.
>> SENTENCE should have the defaults that produce reasonable performance.
>> The docs could document this while also mentioning that the UH's
>> performance is highly dependent on the underlying Java String/Text?
>> Iterator.
>>
>> One can argue that having different defaults based on mode is
>> confusing. In this case I think the defaults should be changed to have
>> the SENTENCE mode perform better. Maybe the options for nice
>> highlights with WORD mode could be put into the docs in this case as
>> some form of an example.
>>
>> As long as I can use the UH with nicely aligned snippets in WORD mode
>> I'm fine with any defaults. I explicitly set them in the config and in
>> the queries most of the time anyways.
>>
>


Re: unified highlighter performance in solr 8.5.1

2020-07-03 Thread David Smiley
I doubt that WORD mode is impacted much by hl.fragsizeIsMinimum in terms of
quality of the highlight since there are vastly more breaks to pick from.
I think that setting is more useful in SENTENCE mode if you can stand the
perf hit.  If you agree, then why not just let this one default to "true"?

We agree on better documenting the perf trade-off.

Thanks again for working on these settings, BTW.

~ David


On Fri, Jul 3, 2020 at 1:25 PM Nándor Mátravölgyi 
wrote:

> Since the issue seems to be affecting the highlighter differently
> based on which mode it is using, having different defaults for the
> modes could be explored.
>
> WORD may have the new defaults as it has little effect on performance
> and it creates nicer highlights.
> SENTENCE should have the defaults that produce reasonable performance.
> The docs could document this while also mentioning that the UH's
> performance is highly dependent on the underlying Java String/Text?
> Iterator.
>
> One can argue that having different defaults based on mode is
> confusing. In this case I think the defaults should be changed to have
> the SENTENCE mode perform better. Maybe the options for nice
> highlights with WORD mode could be put into the docs in this case as
> some form of an example.
>
> As long as I can use the UH with nicely aligned snippets in WORD mode
> I'm fine with any defaults. I explicitly set them in the config and in
> the queries most of the time anyways.
>


Re: unified highlighter performance in solr 8.5.1

2020-07-03 Thread Nándor Mátravölgyi
Since the issue seems to be affecting the highlighter differently
based on which mode it is using, having different defaults for the
modes could be explored.

WORD may have the new defaults as it has little effect on performance
and it creates nicer highlights.
SENTENCE should have the defaults that produce reasonable performance.
The docs could document this while also mentioning that the UH's
performance is highly dependent on the underlying Java String/Text?
Iterator.

One can argue that having different defaults based on mode is
confusing. In this case I think the defaults should be changed to have
the SENTENCE mode perform better. Maybe the options for nice
highlights with WORD mode could be put into the docs in this case as
some form of an example.

As long as I can use the UH with nicely aligned snippets in WORD mode
I'm fine with any defaults. I explicitly set them in the config and in
the queries most of the time anyways.


Re: unified highlighter performance in solr 8.5.1

2020-07-03 Thread David Smiley
I think we should flip the default of hl.fragsizeIsMinimum to be 'true',
thus have the behavior close to what preceded 8.5.
(a) it was very recently (<= 8.4) the previous behavior and so may require
less tuning for users in 8.6 henceforth
(b) it's significantly faster for long text -- seems to be 2x to 5x for
long documents (assuming no change in hl.fragAlignRatio).  If the user
additionally configures hl.fragAlignRatio to 0 (also the previous behavior;
0.5 is the new default), I saw another 6x on top of that for "doc3" in the
test data Michal prepared.

Although I like that the sizing looks nicer, I think that is more from the
introduction and new default of hl.fragAlignRatio=0.5 than it is
hl.fragsizeIsMinimum=false.  We might even consider lowering
hl.fragAlignRatio to say 0.3 and retain pretty reasonable highlights
(avoids the extreme cases occurring with '0') and additional performance
benefit from that.

What do you think Nandor, Michal?

I'm hoping a change in settings (+ some better notes/docs on this) could
slip into an 8.6, all done by myself ASAP.

~ David


On Fri, Jun 19, 2020 at 2:32 PM Nándor Mátravölgyi 
wrote:

> Hi!
>
> With the provided test I've profiled the preceding() and following()
> calls on the base Java iterators in the different options.
>
> === default highlighter arguments ===
> Calling the test query with SENTENCE base iterator:
> - from LengthGoalBreakIterator.following(): 1130 calls of
> baseIter.preceding() took 1.039629 seconds in total
> - from LengthGoalBreakIterator.following(): 1140 calls of
> baseIter.following() took 0.340679 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1150 calls of
> baseIter.preceding() took 0.099344 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1100 calls of
> baseIter.following() took 0.015156 seconds in total
>
> Calling the test query with WORD base iterator:
> - from LengthGoalBreakIterator.following(): 1200 calls of
> baseIter.preceding() took 0.001006 seconds in total
> - from LengthGoalBreakIterator.following(): 1700 calls of
> baseIter.following() took 0.006278 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1710 calls of
> baseIter.preceding() took 0.016320 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1090 calls of
> baseIter.following() took 0.000527 seconds in total
>
> === hl.fragsizeIsMinimum=true=0 ===
> Calling the test query with SENTENCE base iterator:
> - from LengthGoalBreakIterator.following(): 860 calls of
> baseIter.following() took 0.012593 seconds in total
> - from LengthGoalBreakIterator.preceding(): 870 calls of
> baseIter.preceding() took 0.022208 seconds in total
>
> Calling the test query with WORD base iterator:
> - from LengthGoalBreakIterator.following(): 1360 calls of
> baseIter.following() took 0.004789 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1370 calls of
> baseIter.preceding() took 0.015983 seconds in total
>
> === hl.fragsizeIsMinimum=true ===
> Calling the test query with SENTENCE base iterator:
> - from LengthGoalBreakIterator.following(): 980 calls of
> baseIter.following() took 0.010253 seconds in total
> - from LengthGoalBreakIterator.preceding(): 980 calls of
> baseIter.preceding() took 0.341997 seconds in total
>
> Calling the test query with WORD base iterator:
> - from LengthGoalBreakIterator.following(): 1670 calls of
> baseIter.following() took 0.005150 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1680 calls of
> baseIter.preceding() took 0.013657 seconds in total
>
> === hl.fragAlignRatio=0 ===
> Calling the test query with SENTENCE base iterator:
> - from LengthGoalBreakIterator.following(): 1070 calls of
> baseIter.preceding() took 1.312056 seconds in total
> - from LengthGoalBreakIterator.following(): 1080 calls of
> baseIter.following() took 0.678575 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1080 calls of
> baseIter.preceding() took 0.020507 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1080 calls of
> baseIter.following() took 0.006977 seconds in total
>
> Calling the test query with WORD base iterator:
> - from LengthGoalBreakIterator.following(): 880 calls of
> baseIter.preceding() took 0.000706 seconds in total
> - from LengthGoalBreakIterator.following(): 1370 calls of
> baseIter.following() took 0.004110 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1380 calls of
> baseIter.preceding() took 0.014752 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1380 calls of
> baseIter.following() took 0.000106 seconds in total
>
> There is definitely a big difference between SENTENCE and WORD. I'm
> not sure how we can improve the logic on our side while keeping the
> features as is. Since the number of calls is roughly the same for when
> the performance is good and bad, it seems to depend on what the text
> is that the iterator is traversing.
>


Re: unified highlighter performance in solr 8.5.1

2020-06-19 Thread Nándor Mátravölgyi
Hi!

With the provided test I've profiled the preceding() and following()
calls on the base Java iterators in the different options.

=== default highlighter arguments ===
Calling the test query with SENTENCE base iterator:
- from LengthGoalBreakIterator.following(): 1130 calls of
baseIter.preceding() took 1.039629 seconds in total
- from LengthGoalBreakIterator.following(): 1140 calls of
baseIter.following() took 0.340679 seconds in total
- from LengthGoalBreakIterator.preceding(): 1150 calls of
baseIter.preceding() took 0.099344 seconds in total
- from LengthGoalBreakIterator.preceding(): 1100 calls of
baseIter.following() took 0.015156 seconds in total

Calling the test query with WORD base iterator:
- from LengthGoalBreakIterator.following(): 1200 calls of
baseIter.preceding() took 0.001006 seconds in total
- from LengthGoalBreakIterator.following(): 1700 calls of
baseIter.following() took 0.006278 seconds in total
- from LengthGoalBreakIterator.preceding(): 1710 calls of
baseIter.preceding() took 0.016320 seconds in total
- from LengthGoalBreakIterator.preceding(): 1090 calls of
baseIter.following() took 0.000527 seconds in total

=== hl.fragsizeIsMinimum=true=0 ===
Calling the test query with SENTENCE base iterator:
- from LengthGoalBreakIterator.following(): 860 calls of
baseIter.following() took 0.012593 seconds in total
- from LengthGoalBreakIterator.preceding(): 870 calls of
baseIter.preceding() took 0.022208 seconds in total

Calling the test query with WORD base iterator:
- from LengthGoalBreakIterator.following(): 1360 calls of
baseIter.following() took 0.004789 seconds in total
- from LengthGoalBreakIterator.preceding(): 1370 calls of
baseIter.preceding() took 0.015983 seconds in total

=== hl.fragsizeIsMinimum=true ===
Calling the test query with SENTENCE base iterator:
- from LengthGoalBreakIterator.following(): 980 calls of
baseIter.following() took 0.010253 seconds in total
- from LengthGoalBreakIterator.preceding(): 980 calls of
baseIter.preceding() took 0.341997 seconds in total

Calling the test query with WORD base iterator:
- from LengthGoalBreakIterator.following(): 1670 calls of
baseIter.following() took 0.005150 seconds in total
- from LengthGoalBreakIterator.preceding(): 1680 calls of
baseIter.preceding() took 0.013657 seconds in total

=== hl.fragAlignRatio=0 ===
Calling the test query with SENTENCE base iterator:
- from LengthGoalBreakIterator.following(): 1070 calls of
baseIter.preceding() took 1.312056 seconds in total
- from LengthGoalBreakIterator.following(): 1080 calls of
baseIter.following() took 0.678575 seconds in total
- from LengthGoalBreakIterator.preceding(): 1080 calls of
baseIter.preceding() took 0.020507 seconds in total
- from LengthGoalBreakIterator.preceding(): 1080 calls of
baseIter.following() took 0.006977 seconds in total

Calling the test query with WORD base iterator:
- from LengthGoalBreakIterator.following(): 880 calls of
baseIter.preceding() took 0.000706 seconds in total
- from LengthGoalBreakIterator.following(): 1370 calls of
baseIter.following() took 0.004110 seconds in total
- from LengthGoalBreakIterator.preceding(): 1380 calls of
baseIter.preceding() took 0.014752 seconds in total
- from LengthGoalBreakIterator.preceding(): 1380 calls of
baseIter.following() took 0.000106 seconds in total

There is definitely a big difference between SENTENCE and WORD. I'm
not sure how we can improve the logic on our side while keeping the
features as is. Since the number of calls is roughly the same for when
the performance is good and bad, it seems to depend on what the text
is that the iterator is traversing.


Re: unified highlighter performance in solr 8.5.1

2020-06-08 Thread Michal Hlavac
Hi David,

sorry for my late answer. I created simple test scenarios on github 
https://github.com/hlavki/solr-unified-highlighter-test[1] 
There are 2 documents, both bigger sized.
Test method: 
https://github.com/hlavki/solr-unified-highlighter-test/blob/master/src/test/java/com/example/HighlightTest.java#L60[2]
 

Result is, that with hl.fragsizeIsMinimum=true=0 response 
times are similar to solr 8.4.1
I didn't expect that default configuration values should change response time 
that drastically.

m.

On streda 27. mája 2020 9:14:37 CEST David Smiley wrote:


try setting hl.fragsizeIsMinimum=true
I did some benchmarking and found that this helps quite a bit




BTW I used the highlights.alg benchmark file, with some changes to make it more 
reflective of your scenario -- offsets in postings, and used "enwiki" (english 
wikipedia) docs which are larger than the Reuters ones (so it appears, any 
way).  I had to do a bit of hacking to use the "LengthGoalBreakIterator, which 
wasn't previously used by this framework.


~ David



On Tue, May 26, 2020 at 4:42 PM Michal Hlavac  wrote:


fine, I'l try to write simple test, thanks
 
On utorok 26. mája 2020 17:44:52 CEST David Smiley wrote:
> Please create an issue.  I haven't reproduced it yet but it seems unlikely
> to be user-error.
> 
> ~ David
> 
> 
> On Mon, May 25, 2020 at 9:28 AM Michal Hlavac <_miso@hlavki.eu_> wrote:
> 
> > Hi,
> >
> > I have field:
> >  > stored="true" indexed="false" storeOffsetsWithPositions="true"/>
> >
> > and configuration:
> > true
> > unified
> > true
> > content_txt_sk_highlight
> > 2
> > true
> >
> > Doing query with hl.bs.type=SENTENCE it takes around 1000 - 1300 ms which
> > is really slow.
> > Same query with hl.bs.type=WORD takes from 8 - 45 ms
> >
> > is this normal behaviour or should I create issue?
> >
> > thanks, m.
> >
> 





[1] https://github.com/hlavki/solr-unified-highlighter-test
[2] 
https://github.com/hlavki/solr-unified-highlighter-test/blob/master/src/test/java/com/example/HighlightTest.java#L60
[3] mailto:m...@hlavki.eu


Re: unified highlighter performance in solr 8.5.1

2020-05-28 Thread Nándor Mátravölgyi
Hi!

I've not been able to delve into this issue deeply, but it could be
useful to know that "fragsizeIsMinimum" and "fragAlignRatio" are new
parameters which have behavior changing default values.

Leaving those with their default values makes the comparison between
8.4 and 8.5 like apples to oranges in a sense. To have the new UH
behave like the old one as closely as possible use:
fragsizeIsMinimum=false
fragAlignRatio=0


Re: unified highlighter performance in solr 8.5.1

2020-05-27 Thread David Smiley
try setting hl.fragsizeIsMinimum=true
I did some benchmarking and found that this helps quite a bit


BTW I used the highlights.alg benchmark file, with some changes to make it
more reflective of your scenario -- offsets in postings, and used "enwiki"
(english wikipedia) docs which are larger than the Reuters ones (so it
appears, any way).  I had to do a bit of hacking to use the
"LengthGoalBreakIterator, which wasn't previously used by this framework.

~ David


On Tue, May 26, 2020 at 4:42 PM Michal Hlavac  wrote:

> fine, I'l try to write simple test, thanks
>
>
>
> On utorok 26. mája 2020 17:44:52 CEST David Smiley wrote:
>
> > Please create an issue.  I haven't reproduced it yet but it seems
> unlikely
>
> > to be user-error.
>
> >
>
> > ~ David
>
> >
>
> >
>
> > On Mon, May 25, 2020 at 9:28 AM Michal Hlavac  wrote:
>
> >
>
> > > Hi,
>
> > >
>
> > > I have field:
>
> > > 
> > > stored="true" indexed="false" storeOffsetsWithPositions="true"/>
>
> > >
>
> > > and configuration:
>
> > > true
>
> > > unified
>
> > > true
>
> > > content_txt_sk_highlight
>
> > > 2
>
> > > true
>
> > >
>
> > > Doing query with hl.bs.type=SENTENCE it takes around 1000 - 1300 ms
> which
>
> > > is really slow.
>
> > > Same query with hl.bs.type=WORD takes from 8 - 45 ms
>
> > >
>
> > > is this normal behaviour or should I create issue?
>
> > >
>
> > > thanks, m.
>
> > >
>
> >
>
>


Re: unified highlighter performance in solr 8.5.1

2020-05-26 Thread Michal Hlavac
fine, I'l try to write simple test, thanks

On utorok 26. mája 2020 17:44:52 CEST David Smiley wrote:
> Please create an issue.  I haven't reproduced it yet but it seems unlikely
> to be user-error.
> 
> ~ David
> 
> 
> On Mon, May 25, 2020 at 9:28 AM Michal Hlavac  wrote:
> 
> > Hi,
> >
> > I have field:
> >  > stored="true" indexed="false" storeOffsetsWithPositions="true"/>
> >
> > and configuration:
> > true
> > unified
> > true
> > content_txt_sk_highlight
> > 2
> > true
> >
> > Doing query with hl.bs.type=SENTENCE it takes around 1000 - 1300 ms which
> > is really slow.
> > Same query with hl.bs.type=WORD takes from 8 - 45 ms
> >
> > is this normal behaviour or should I create issue?
> >
> > thanks, m.
> >
> 


Re: unified highlighter performance in solr 8.5.1

2020-05-26 Thread David Smiley
Please create an issue.  I haven't reproduced it yet but it seems unlikely
to be user-error.

~ David


On Mon, May 25, 2020 at 9:28 AM Michal Hlavac  wrote:

> Hi,
>
> I have field:
>  stored="true" indexed="false" storeOffsetsWithPositions="true"/>
>
> and configuration:
> true
> unified
> true
> content_txt_sk_highlight
> 2
> true
>
> Doing query with hl.bs.type=SENTENCE it takes around 1000 - 1300 ms which
> is really slow.
> Same query with hl.bs.type=WORD takes from 8 - 45 ms
>
> is this normal behaviour or should I create issue?
>
> thanks, m.
>


Re: unified highlighter performance in solr 8.5.1

2020-05-25 Thread Michal Hlavac
Yes, have no problems in 8.4.1, only 8.5.1
Also yes, those are multi page pdf files.

m.

On pondelok 25. mája 2020 19:11:31 CEST David Smiley wrote:
> Wow that's terrible!
> So this problem is for SENTENCE in particular, and it's a regression in
> 8.5?  I'll see if I can reproduce this with the Lucene benchmark module.
> 
> I figure you have some meaty text, like "page" size or longer?
> 
> ~ David
> 
> 
> On Mon, May 25, 2020 at 10:38 AM Michal Hlavac  wrote:
> 
> > I did same test on solr 8.4.1 and response times are same for both
> > hl.bs.type=SENTENCE and hl.bs.type=WORD
> >
> > m.
> >
> > On pondelok 25. mája 2020 15:28:24 CEST Michal Hlavac wrote:
> >
> >
> > Hi,
> >
> > I have field:
> >  > stored="true" indexed="false" storeOffsetsWithPositions="true"/>
> >
> > and configuration:
> > true
> > unified
> > true
> > content_txt_sk_highlight
> > 2
> > true
> >
> > Doing query with hl.bs.type=SENTENCE it takes around 1000 - 1300 ms which
> > is really slow.
> > Same query with hl.bs.type=WORD takes from 8 - 45 ms
> >
> > is this normal behaviour or should I create issue?
> >
> > thanks, m.
> >
> >
> >
> 


Re: unified highlighter performance in solr 8.5.1

2020-05-25 Thread David Smiley
Wow that's terrible!
So this problem is for SENTENCE in particular, and it's a regression in
8.5?  I'll see if I can reproduce this with the Lucene benchmark module.

I figure you have some meaty text, like "page" size or longer?

~ David


On Mon, May 25, 2020 at 10:38 AM Michal Hlavac  wrote:

> I did same test on solr 8.4.1 and response times are same for both
> hl.bs.type=SENTENCE and hl.bs.type=WORD
>
> m.
>
> On pondelok 25. mája 2020 15:28:24 CEST Michal Hlavac wrote:
>
>
> Hi,
>
> I have field:
>  stored="true" indexed="false" storeOffsetsWithPositions="true"/>
>
> and configuration:
> true
> unified
> true
> content_txt_sk_highlight
> 2
> true
>
> Doing query with hl.bs.type=SENTENCE it takes around 1000 - 1300 ms which
> is really slow.
> Same query with hl.bs.type=WORD takes from 8 - 45 ms
>
> is this normal behaviour or should I create issue?
>
> thanks, m.
>
>
>


Re: unified highlighter performance in solr 8.5.1

2020-05-25 Thread Michal Hlavac
I did same test on solr 8.4.1 and response times are same for both 
hl.bs.type=SENTENCE and hl.bs.type=WORD

m.

On pondelok 25. mája 2020 15:28:24 CEST Michal Hlavac wrote:


Hi,
 
I have field:

 
and configuration:
true
unified
true
content_txt_sk_highlight
2
true
 
Doing query with hl.bs.type=SENTENCE it takes around 1000 - 1300 ms which is 
really slow.
Same query with hl.bs.type=WORD takes from 8 - 45 ms
 
is this normal behaviour or should I create issue?
 
thanks, m. 




unified highlighter performance in solr 8.5.1

2020-05-25 Thread Michal Hlavac
Hi,

I have field:


and configuration:
true
unified
true
content_txt_sk_highlight
2
true

Doing query with hl.bs.type=SENTENCE it takes around 1000 - 1300 ms which is 
really slow.
Same query with hl.bs.type=WORD takes from 8 - 45 ms

is this normal behaviour or should I create issue?

thanks, m.