[jira] [Commented] (LUCENE-10616) Moving to dictionaries has made stored fields slower at skipping

2022-07-26 Thread fang hou (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571217#comment-17571217
 ] 

fang hou commented on LUCENE-10616:
---

I think this pr [https://github.com/apache/lucene/pull/1003] is ready for 
review. As Adrien advised above, this pr changed {{decompress}} signature to 
return {{InputStream}} to make it able to decompress lazily. Different than 
returning {{STOP}} in {{{}StoredFieldVisitor#needsField{}}}(tried but found 
it's maybe impossible due to multiple value fields, see test case), this pr 
optimized skip method to be more smart to bypass unneeded compressed block by 
reading compressed block length. So for large unneeded field, we can save many 
decompression time. This applied to both {{BEST_SPEED}} mode and 
{{HIGH_COMPRESSION}} mode. So this pr optimized these two modes with preset 
dictionary. Could someone give some feedbacks? thanks cc [~jpountz] 

> Moving to dictionaries has made stored fields slower at skipping
> 
>
> Key: LUCENE-10616
> URL: https://issues.apache.org/jira/browse/LUCENE-10616
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> [~ywelsch] has been digging into a regression of stored fields retrieval that 
> is caused by LUCENE-9486.
> Say your documents have two stored fields, one that is 100B and is stored 
> first, and the other one that is 100kB, and you are only interested in the 
> first one. While the idea behind blocks of stored fields is to store multiple 
> documents in the same block to leverage redundancy across documents, 
> sometimes documents are larger than the block size. As soon as documents are 
> larger than 2x the block size, our stored fields format splits such large 
> documents into multiple blocks, so that you wouldn't need to decompress 
> everything only to retrieve a couple small fields.
> Before LUCENE-9486, BEST_SPEED had a block size of 16kB, so only retrieving 
> the first field value would only need to decompress 16kB of data. With the 
> move to preset dictionaries in LUCENE-9486 and then LUCENE-9917, we now have 
> blocks of 80kB, so stored fields would now need to decompress 80kB of data, 
> 5x more than before.
> With dictionaries, our blocks are now split into 10 sub blocks. We happen to 
> eagerly decompress all sub blocks that intersect with the stored document, 
> which is why we would decompress 80kB of data, but this is an implementation 
> detail. It should be possible to decompress these sub blocks lazily so that 
> we would only decompress those that intersect with one of the field values 
> that the user is interested in retrieving?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10616) Moving to dictionaries has made stored fields slower at skipping

2022-07-04 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562072#comment-17562072
 ] 

Adrien Grand commented on LUCENE-10616:
---

Thanks [~joe hou] for giving it a try! The high-level idea looks good to me, of 
somehow leveraging information in the {{StoredFieldVisitor}} to only decompress 
the bits that matter. In terms of implementation, I would like to see if we can 
avoid introducing the new {{StoredFieldVisitor#hasMoreFieldsToVisit}} method 
and rely on {{StoredFieldVisitor#needsField}} returning {{STOP}} instead. The 
fact that decompressing data and decoding decompressed data are interleaved 
also make the code harder to test, I wonder if we could change the signature of 
{{Decompressor#decompress}} to return an {{InputStream}} that would decompress 
data lazily instead of filling a {{BytesRef}} so that it's possible to stop 
decompressing early while still being able to test decompression and decoding 
in isolation?

> Moving to dictionaries has made stored fields slower at skipping
> 
>
> Key: LUCENE-10616
> URL: https://issues.apache.org/jira/browse/LUCENE-10616
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [~ywelsch] has been digging into a regression of stored fields retrieval that 
> is caused by LUCENE-9486.
> Say your documents have two stored fields, one that is 100B and is stored 
> first, and the other one that is 100kB, and you are only interested in the 
> first one. While the idea behind blocks of stored fields is to store multiple 
> documents in the same block to leverage redundancy across documents, 
> sometimes documents are larger than the block size. As soon as documents are 
> larger than 2x the block size, our stored fields format splits such large 
> documents into multiple blocks, so that you wouldn't need to decompress 
> everything only to retrieve a couple small fields.
> Before LUCENE-9486, BEST_SPEED had a block size of 16kB, so only retrieving 
> the first field value would only need to decompress 16kB of data. With the 
> move to preset dictionaries in LUCENE-9486 and then LUCENE-9917, we now have 
> blocks of 80kB, so stored fields would now need to decompress 80kB of data, 
> 5x more than before.
> With dictionaries, our blocks are now split into 10 sub blocks. We happen to 
> eagerly decompress all sub blocks that intersect with the stored document, 
> which is why we would decompress 80kB of data, but this is an implementation 
> detail. It should be possible to decompress these sub blocks lazily so that 
> we would only decompress those that intersect with one of the field values 
> that the user is interested in retrieving?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10616) Moving to dictionaries has made stored fields slower at skipping

2022-07-03 Thread fang hou (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561823#comment-17561823
 ] 

fang hou commented on LUCENE-10616:
---

hi [~jpountz] i tried to resolve this by changing the decompress loop in 
LZ4WithPresetDictCompressionMode to return early when we got all fields we 
need. But i'm a little bit hesitated to do so because it may break the api 
design of decompress that it may not return enough length bytes than we 
expected. Besides, i'm not sure if it's the right direction to change logics in 
LZ4WithPresetDictCompressionMode. Should this decompression optimization happen 
in Lucene90CompressingStoredFieldsReader(but i haven't found an easy solution 
to do)? here is a WIP pr to demo my current thoughts 
[https://github.com/apache/lucene/pull/1003.] PLZ give me some insights thanks!

> Moving to dictionaries has made stored fields slower at skipping
> 
>
> Key: LUCENE-10616
> URL: https://issues.apache.org/jira/browse/LUCENE-10616
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [~ywelsch] has been digging into a regression of stored fields retrieval that 
> is caused by LUCENE-9486.
> Say your documents have two stored fields, one that is 100B and is stored 
> first, and the other one that is 100kB, and you are only interested in the 
> first one. While the idea behind blocks of stored fields is to store multiple 
> documents in the same block to leverage redundancy across documents, 
> sometimes documents are larger than the block size. As soon as documents are 
> larger than 2x the block size, our stored fields format splits such large 
> documents into multiple blocks, so that you wouldn't need to decompress 
> everything only to retrieve a couple small fields.
> Before LUCENE-9486, BEST_SPEED had a block size of 16kB, so only retrieving 
> the first field value would only need to decompress 16kB of data. With the 
> move to preset dictionaries in LUCENE-9486 and then LUCENE-9917, we now have 
> blocks of 80kB, so stored fields would now need to decompress 80kB of data, 
> 5x more than before.
> With dictionaries, our blocks are now split into 10 sub blocks. We happen to 
> eagerly decompress all sub blocks that intersect with the stored document, 
> which is why we would decompress 80kB of data, but this is an implementation 
> detail. It should be possible to decompress these sub blocks lazily so that 
> we would only decompress those that intersect with one of the field values 
> that the user is interested in retrieving?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10616) Moving to dictionaries has made stored fields slower at skipping

2022-06-28 Thread fang hou (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 fang hou commented on  LUCENE-10616  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Moving to dictionaries has made stored fields slower at skipping   
 

  
 
 
 
 

 
 hi Adrien Grand if no one takes it, may I give it a try?  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)