[jira] [Commented] (LUCENE-9211) Adding compression to BinaryDocValues storage

2020-05-21 Thread Viral Gandhi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113384#comment-17113384
 ] 

Viral Gandhi commented on LUCENE-9211:
--

This improvement had a negative impact on our internal benchmarking when we 
tried to upgrade to Lucene 8.5.1. I have created an issue regarding that - 
https://issues.apache.org/jira/browse/LUCENE-9378.

> Adding compression to BinaryDocValues storage
> -
>
> Key: LUCENE-9211
> URL: https://issues.apache.org/jira/browse/LUCENE-9211
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 8.5
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> While SortedSetDocValues can be used today to store identical values in a 
> compact form this is not effective for data with many unique values.
> The proposal is that BinaryDocValues should be stored in LZ4 compressed 
> blocks which can dramatically reduce disk storage costs in many cases. The 
> proposal is blocks of a number of documents are stored as a single compressed 
> blob along with metadata that records offsets where the original document 
> values can be found in the uncompressed content.
> There's a trade-off here between efficient compression (more docs-per-block = 
> better compression) and fast retrieval times (fewer docs-per-block = faster 
> read access for single values). A fixed block size of 32 docs seems like it 
> would be a reasonable compromise for most scenarios.
> A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9211) Adding compression to BinaryDocValues storage

2020-02-18 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039104#comment-17039104
 ] 

ASF subversion and git services commented on LUCENE-9211:
-

Commit ce2959fe4cb1d1e77df04464c46004bf7846f6b5 in lucene-solr's branch 
refs/heads/master from markharwood
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ce2959f ]

LUCENE-9211 Add compression for Binary doc value fields (#1234)

Stores groups of 32 binary doc values in LZ4-compressed blocks.

> Adding compression to BinaryDocValues storage
> -
>
> Key: LUCENE-9211
> URL: https://issues.apache.org/jira/browse/LUCENE-9211
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> While SortedSetDocValues can be used today to store identical values in a 
> compact form this is not effective for data with many unique values.
> The proposal is that BinaryDocValues should be stored in LZ4 compressed 
> blocks which can dramatically reduce disk storage costs in many cases. The 
> proposal is blocks of a number of documents are stored as a single compressed 
> blob along with metadata that records offsets where the original document 
> values can be found in the uncompressed content.
> There's a trade-off here between efficient compression (more docs-per-block = 
> better compression) and fast retrieval times (fewer docs-per-block = faster 
> read access for single values). A fixed block size of 32 docs seems like it 
> would be a reasonable compromise for most scenarios.
> A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9211) Adding compression to BinaryDocValues storage

2020-02-14 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036806#comment-17036806
 ] 

Adrien Grand commented on LUCENE-9211:
--

I had a quick look at Juan's commit, there are things I like and things I have 
questions about. Since this PR is ready, or almost ready, I'd suggest merging 
this one first.

[~juan.duran] I saw that your commit tried to modify the current 
Lucene80DocValuesFormat. I'm a bit nervous about it because it makes it hard to 
spot any potential subtle difference in the on-disk format that would cause 
bugs, so I'd suggest creating a new Lucene85DocValuesFormat instead, even if it 
has the same ideas or even same on-disk format as the current 
Lucene80DocValuesFormat?

> Adding compression to BinaryDocValues storage
> -
>
> Key: LUCENE-9211
> URL: https://issues.apache.org/jira/browse/LUCENE-9211
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
>  Labels: pull-request-available
>
> While SortedSetDocValues can be used today to store identical values in a 
> compact form this is not effective for data with many unique values.
> The proposal is that BinaryDocValues should be stored in LZ4 compressed 
> blocks which can dramatically reduce disk storage costs in many cases. The 
> proposal is blocks of a number of documents are stored as a single compressed 
> blob along with metadata that records offsets where the original document 
> values can be found in the uncompressed content.
> There's a trade-off here between efficient compression (more docs-per-block = 
> better compression) and fast retrieval times (fewer docs-per-block = faster 
> read access for single values). A fixed block size of 32 docs seems like it 
> would be a reasonable compromise for most scenarios.
> A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9211) Adding compression to BinaryDocValues storage

2020-02-13 Thread Mark Harwood (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036139#comment-17036139
 ] 

Mark Harwood commented on LUCENE-9211:
--

{quote}the link did not work.
{quote}
 

Sorry, formatting must have mangled my URL - this is the full link FWIW 
[https://github.com/apache/lucene-solr/blob/master/lucene/benchmark/conf/spatial.alg#L31]

Thanks for testing and good to know your tests showed little difference in 
performance.

What's your view on how best to proceed from here? Wait for Juan's PR to land 
before doing any more?

 

 

> Adding compression to BinaryDocValues storage
> -
>
> Key: LUCENE-9211
> URL: https://issues.apache.org/jira/browse/LUCENE-9211
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
>  Labels: pull-request-available
>
> While SortedSetDocValues can be used today to store identical values in a 
> compact form this is not effective for data with many unique values.
> The proposal is that BinaryDocValues should be stored in LZ4 compressed 
> blocks which can dramatically reduce disk storage costs in many cases. The 
> proposal is blocks of a number of documents are stored as a single compressed 
> blob along with metadata that records offsets where the original document 
> values can be found in the uncompressed content.
> There's a trade-off here between efficient compression (more docs-per-block = 
> better compression) and fast retrieval times (fewer docs-per-block = faster 
> read access for single values). A fixed block size of 32 docs seems like it 
> would be a reasonable compromise for most scenarios.
> A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9211) Adding compression to BinaryDocValues storage

2020-02-12 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035624#comment-17035624
 ] 

David Smiley commented on LUCENE-9211:
--

Thanks so much for running the benchmarks [~mharwood]!  When you say you 
modified "this line"; the link did not work.  If you merely changed the default 
spatial.alg to use composite then it's only indexing point data which is not 
realistic for this spatial strategy.  Instead LUCENE-5579 has a spatial.alg 
file that converts those points to random circles and it'll be more 
interesting.  I just did a diff on that spatial.alg with the default one and 
they are pretty similar overall.

> Adding compression to BinaryDocValues storage
> -
>
> Key: LUCENE-9211
> URL: https://issues.apache.org/jira/browse/LUCENE-9211
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
>  Labels: pull-request-available
>
> While SortedSetDocValues can be used today to store identical values in a 
> compact form this is not effective for data with many unique values.
> The proposal is that BinaryDocValues should be stored in LZ4 compressed 
> blocks which can dramatically reduce disk storage costs in many cases. The 
> proposal is blocks of a number of documents are stored as a single compressed 
> blob along with metadata that records offsets where the original document 
> values can be found in the uncompressed content.
> There's a trade-off here between efficient compression (more docs-per-block = 
> better compression) and fast retrieval times (fewer docs-per-block = faster 
> read access for single values). A fixed block size of 32 docs seems like it 
> would be a reasonable compromise for most scenarios.
> A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9211) Adding compression to BinaryDocValues storage

2020-02-12 Thread juan camilo rodriguez duran (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035224#comment-17035224
 ] 

juan camilo rodriguez duran commented on LUCENE-9211:
-

[~mharwood] the main idea of mine PR it just to make code cleaner and 
extensible, it is not supposed to introduce any regression nor improvement of 
the current format. (spoiler alert: I'm working in the extension to improve 
sorted and sorted set doc values for the lookup using BytesRef)

> Adding compression to BinaryDocValues storage
> -
>
> Key: LUCENE-9211
> URL: https://issues.apache.org/jira/browse/LUCENE-9211
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
>  Labels: pull-request-available
>
> While SortedSetDocValues can be used today to store identical values in a 
> compact form this is not effective for data with many unique values.
> The proposal is that BinaryDocValues should be stored in LZ4 compressed 
> blocks which can dramatically reduce disk storage costs in many cases. The 
> proposal is blocks of a number of documents are stored as a single compressed 
> blob along with metadata that records offsets where the original document 
> values can be found in the uncompressed content.
> There's a trade-off here between efficient compression (more docs-per-block = 
> better compression) and fast retrieval times (fewer docs-per-block = faster 
> read access for single values). A fixed block size of 32 docs seems like it 
> would be a reasonable compromise for most scenarios.
> A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9211) Adding compression to BinaryDocValues storage

2020-02-11 Thread Mark Harwood (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034557#comment-17034557
 ] 

Mark Harwood commented on LUCENE-9211:
--

Thanks Juan and David for your comments.

I ran the spatial.alg test and modified [this 
line|[https://github.com/apache/lucene-solr/blob/master/lucene/benchmark/conf/spatial.alg#L31]]
 to use the "composite" strategy in order to exercise the Binary DV storage. I 
did four runs of master and PR 1234 and there wasn't a clear pattern of changes 
in speed.

 
||Master read recs/s||PR 1234 read recs/s||
|875.66|884.96|
|869.94|841.75|
|823.38|853.97|
|842.11|878.73|

 
||Master write docs/s||PR 1234 write docs/s||
|7,688.46|{color:#00}8,163.20{color}|
|8,223.35|{color:#00}7,882.39{color}|
|7,381.71|{color:#00}7,930.78{color}|
|8,385.32|7,925|

 

> Adding compression to BinaryDocValues storage
> -
>
> Key: LUCENE-9211
> URL: https://issues.apache.org/jira/browse/LUCENE-9211
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
>  Labels: pull-request-available
>
> While SortedSetDocValues can be used today to store identical values in a 
> compact form this is not effective for data with many unique values.
> The proposal is that BinaryDocValues should be stored in LZ4 compressed 
> blocks which can dramatically reduce disk storage costs in many cases. The 
> proposal is blocks of a number of documents are stored as a single compressed 
> blob along with metadata that records offsets where the original document 
> values can be found in the uncompressed content.
> There's a trade-off here between efficient compression (more docs-per-block = 
> better compression) and fast retrieval times (fewer docs-per-block = faster 
> read access for single values). A fixed block size of 32 docs seems like it 
> would be a reasonable compromise for most scenarios.
> A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9211) Adding compression to BinaryDocValues storage

2020-02-11 Thread juan camilo rodriguez duran (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034350#comment-17034350
 ] 

juan camilo rodriguez duran commented on LUCENE-9211:
-

[~mharwood] here you will find a draft for the PR I'm preparing 
[https://github.com/juanka588/lucene-solr/commit/b7c8d14d53190753ea789c3fb3d299d3374c3677]

> Adding compression to BinaryDocValues storage
> -
>
> Key: LUCENE-9211
> URL: https://issues.apache.org/jira/browse/LUCENE-9211
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
>  Labels: pull-request-available
>
> While SortedSetDocValues can be used today to store identical values in a 
> compact form this is not effective for data with many unique values.
> The proposal is that BinaryDocValues should be stored in LZ4 compressed 
> blocks which can dramatically reduce disk storage costs in many cases. The 
> proposal is blocks of a number of documents are stored as a single compressed 
> blob along with metadata that records offsets where the original document 
> values can be found in the uncompressed content.
> There's a trade-off here between efficient compression (more docs-per-block = 
> better compression) and fast retrieval times (fewer docs-per-block = faster 
> read access for single values). A fixed block size of 32 docs seems like it 
> would be a reasonable compromise for most scenarios.
> A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9211) Adding compression to BinaryDocValues storage

2020-02-08 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033125#comment-17033125
 ] 

David Smiley commented on LUCENE-9211:
--

This seems cool for some use-cases but I worry about the overhead for others.  
I think I have a benchmark module ".alg" file for SerializedDVStrategy in 
spatial-extras.  I should try it out on your PR.

I wish it was easier for us to let users toggle the choice of DocValuesFormat 
only for one type but not for others.  DocValuesFormat is really a format of 
formats, which is inflexible.  [~juan.duran], a colleague of mine, has been 
diving into this topic lately and I hope he shares it here (new issue of 
course).

> Adding compression to BinaryDocValues storage
> -
>
> Key: LUCENE-9211
> URL: https://issues.apache.org/jira/browse/LUCENE-9211
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
>  Labels: pull-request-available
>
> While SortedSetDocValues can be used today to store identical values in a 
> compact form this is not effective for data with many unique values.
> The proposal is that BinaryDocValues should be stored in LZ4 compressed 
> blocks which can dramatically reduce disk storage costs in many cases. The 
> proposal is blocks of a number of documents are stored as a single compressed 
> blob along with metadata that records offsets where the original document 
> values can be found in the uncompressed content.
> There's a trade-off here between efficient compression (more docs-per-block = 
> better compression) and fast retrieval times (fewer docs-per-block = faster 
> read access for single values). A fixed block size of 32 docs seems like it 
> would be a reasonable compromise for most scenarios.
> A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org