[jira] [Commented] (LUCENE-9486) Explore using preset dictionaries with LZ4 for stored fields

2020-10-26 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220560#comment-17220560
 ] 

Adrien Grand commented on LUCENE-9486:
--

As the JVM now allows switching the zlib implementation I've been doing tests 
between my system implementation (vanilla zlib 1.2.11 shipped by Ubuntu) and 
the Cloudflare fork (https://github.com/cloudflare/zlib compiled from source). 
The latter gives significant speedups. I included the compression ratios for 
completeness but they should not be relevant as the difference - if I 
understand it correctly - is due to the use of a faster hashing function which 
retains the properties we're looking for for compression, but the sequences of 
bytes that cause collisions are different and thus the compression ratio might 
be slightly higher or slightly lower depending on the dataset.

|| Dataset || zlib || Stored fields size (GB) || Indexing (s) || Document 
lookup (us) ||
| 1M highly compressible nginx access logs | system | 64.9 | 12.3 | 42.5 |
| 1M highly compressible nginx access logs | Cloudflare | 64.5 | 9.9 | 28.4 |
| 50k enwiki docs | system | 337.1 | 32.9 | 223.9 |
| 50k enwiki docs | Cloudflare | 337.1 | 22.2 | 168.3 |



> Explore using preset dictionaries with LZ4 for stored fields
> 
>
> Key: LUCENE-9486
> URL: https://issues.apache.org/jira/browse/LUCENE-9486
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.7
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Follow-up of LUCENE-9447: using preset dictionaries with DEFLATE provided 
> very significant gains. Adding support for preset dictionaries with LZ4 would 
> be easy so let's give it a try?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9486) Explore using preset dictionaries with LZ4 for stored fields

2020-09-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196826#comment-17196826
 ] 

ASF subversion and git services commented on LUCENE-9486:
-

Commit 78b8a0ae39fd7fe1d349edd4f6b1b946df1fd759 in lucene-solr's branch 
refs/heads/branch_8x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=78b8a0a ]

LUCENE-9486: Use ByteBuffersDataOutput to collect data like on master.


> Explore using preset dictionaries with LZ4 for stored fields
> 
>
> Key: LUCENE-9486
> URL: https://issues.apache.org/jira/browse/LUCENE-9486
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.7
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Follow-up of LUCENE-9447: using preset dictionaries with DEFLATE provided 
> very significant gains. Adding support for preset dictionaries with LZ4 would 
> be easy so let's give it a try?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9486) Explore using preset dictionaries with LZ4 for stored fields

2020-09-03 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190049#comment-17190049
 ] 

ASF subversion and git services commented on LUCENE-9486:
-

Commit 4cedd92dee0ad1e9e3c8f655574d2af8ab1abd37 in lucene-solr's branch 
refs/heads/branch_8x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4cedd92 ]

LUCENE-9486: Use preset dictionaries with LZ4 for BEST_SPEED. (#1793)


> Explore using preset dictionaries with LZ4 for stored fields
> 
>
> Key: LUCENE-9486
> URL: https://issues.apache.org/jira/browse/LUCENE-9486
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Follow-up of LUCENE-9447: using preset dictionaries with DEFLATE provided 
> very significant gains. Adding support for preset dictionaries with LZ4 would 
> be easy so let's give it a try?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9486) Explore using preset dictionaries with LZ4 for stored fields

2020-09-03 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190048#comment-17190048
 ] 

ASF subversion and git services commented on LUCENE-9486:
-

Commit 73371cb4b6365c4aca2700c2e14e20cdbf1e0c12 in lucene-solr's branch 
refs/heads/master from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=73371cb ]

LUCENE-9486: Fix TestTieredMergePolicy failure.


> Explore using preset dictionaries with LZ4 for stored fields
> 
>
> Key: LUCENE-9486
> URL: https://issues.apache.org/jira/browse/LUCENE-9486
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Follow-up of LUCENE-9447: using preset dictionaries with DEFLATE provided 
> very significant gains. Adding support for preset dictionaries with LZ4 would 
> be easy so let's give it a try?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9486) Explore using preset dictionaries with LZ4 for stored fields

2020-09-03 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190029#comment-17190029
 ] 

ASF subversion and git services commented on LUCENE-9486:
-

Commit 27aa5c5f59e8cb03316efa504f0351decd41d61c in lucene-solr's branch 
refs/heads/master from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=27aa5c5 ]

LUCENE-9486: Use preset dictionaries with LZ4 for BEST_SPEED. (#1793)



> Explore using preset dictionaries with LZ4 for stored fields
> 
>
> Key: LUCENE-9486
> URL: https://issues.apache.org/jira/browse/LUCENE-9486
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Follow-up of LUCENE-9447: using preset dictionaries with DEFLATE provided 
> very significant gains. Adding support for preset dictionaries with LZ4 would 
> be easy so let's give it a try?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9486) Explore using preset dictionaries with LZ4 for stored fields

2020-08-27 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186038#comment-17186038
 ] 

Robert Muir commented on LUCENE-9486:
-

+1

> Explore using preset dictionaries with LZ4 for stored fields
> 
>
> Key: LUCENE-9486
> URL: https://issues.apache.org/jira/browse/LUCENE-9486
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Follow-up of LUCENE-9447: using preset dictionaries with DEFLATE provided 
> very significant gains. Adding support for preset dictionaries with LZ4 would 
> be easy so let's give it a try?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9486) Explore using preset dictionaries with LZ4 for stored fields

2020-08-27 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17185904#comment-17185904
 ] 

Adrien Grand commented on LUCENE-9486:
--

I played with various configurations and ended up with a preset dictionary of 
4kB combined with 10 sub blocks of 60kB, which gives interesting results. Here 
are some benchmarks on the same datasets as LUCENE-9447:

On highly compressible JSON logs:

||Method||Index size(MB)||Index time(s)||Avg fetch time (us)||
|LZ4(16kB) (current BEST_SPEED)|304,2|9|5|
|LZ4(60kB)|141,7|7,5|10|
|LZ4(256kB)|105,1|7,5|33|
|LZ4(1MB)|96,5|7,5|115|
|LZ4 with preset dict (new BEST_SPEED)|91,9|7,5|16|
|Deflate with preset dict (new BEST_SPEED)|64.9|14|41|

On enwiki documents:

||Method||Index size(MB)||Index time(s)||Avg fetch time (us)||
|LZ4(16kB) (current BEST_SPEED)|558,8|14,5|83|
|LZ4(60kB)|526,2|15|120|
|LZ4(256kB)|523,1|15|323|
|LZ4(1MB)|521,3|15,5|1151|
|LZ4 with preset dict (new BEST_SPEED)|515,2|15|135|
|Deflate with preset dict (new BEST_SPEED)|338.0|35|250|

It makes fetch times a bit slower, which is fair I think given that these fetch 
times are still way under the cost of a page fault. Indexing remains as fast as 
today and compression gets respectively 3.3x and 8% better on these datasets.

I also included the results with BEST_COMPRESSION in the above benchmarks to 
show the trade-off that users are making when going with one versus the other.

> Explore using preset dictionaries with LZ4 for stored fields
> 
>
> Key: LUCENE-9486
> URL: https://issues.apache.org/jira/browse/LUCENE-9486
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Follow-up of LUCENE-9447: using preset dictionaries with DEFLATE provided 
> very significant gains. Adding support for preset dictionaries with LZ4 would 
> be easy so let's give it a try?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org