[jira] [Commented] (LUCENE-7589) Prevent outliers from raising the number of bits of everyone with numeric doc values
[ https://issues.apache.org/jira/browse/LUCENE-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761076#comment-15761076 ] ASF subversion and git services commented on LUCENE-7589: - Commit 3b182aa2fb3e4062f6ec5be819f3aa70aa2e523d in lucene-solr's branch refs/heads/feature/metrics from [~jpountz] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3b182aa ] LUCENE-7589: Prevent outliers from raising the bpv for everyone. > Prevent outliers from raising the number of bits of everyone with numeric doc > values > > > Key: LUCENE-7589 > URL: https://issues.apache.org/jira/browse/LUCENE-7589 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: master (7.0) > > Attachments: LUCENE-7589.patch > > > Today we encode entire segments with a single number of bits per value. It > was done this way because it was faster, but it also means a single outlier > can significantly increase the space requirements. I think we should have > protection against that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7589) Prevent outliers from raising the number of bits of everyone with numeric doc values
[ https://issues.apache.org/jira/browse/LUCENE-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15754558#comment-15754558 ] Adrien Grand commented on LUCENE-7589: -- Like Mike predicted, this helped the NYC taxi bench a bit, Disk usage for the dropoff datetime field went from 194MB to 166MB: http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#index_size_by_field > Prevent outliers from raising the number of bits of everyone with numeric doc > values > > > Key: LUCENE-7589 > URL: https://issues.apache.org/jira/browse/LUCENE-7589 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: master (7.0) > > Attachments: LUCENE-7589.patch > > > Today we encode entire segments with a single number of bits per value. It > was done this way because it was faster, but it also means a single outlier > can significantly increase the space requirements. I think we should have > protection against that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7589) Prevent outliers from raising the number of bits of everyone with numeric doc values
[ https://issues.apache.org/jira/browse/LUCENE-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15751785#comment-15751785 ] ASF subversion and git services commented on LUCENE-7589: - Commit 3b182aa2fb3e4062f6ec5be819f3aa70aa2e523d in lucene-solr's branch refs/heads/master from [~jpountz] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3b182aa ] LUCENE-7589: Prevent outliers from raising the bpv for everyone. > Prevent outliers from raising the number of bits of everyone with numeric doc > values > > > Key: LUCENE-7589 > URL: https://issues.apache.org/jira/browse/LUCENE-7589 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Attachments: LUCENE-7589.patch > > > Today we encode entire segments with a single number of bits per value. It > was done this way because it was faster, but it also means a single outlier > can significantly increase the space requirements. I think we should have > protection against that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7589) Prevent outliers from raising the number of bits of everyone with numeric doc values
[ https://issues.apache.org/jira/browse/LUCENE-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15741566#comment-15741566 ] Michael McCandless commented on LUCENE-7589: bq. However if you add a new field that stores the average number of miles per hour as a long doc values field, then it highlights the quality issues of this dataset and disk usage for this field goes from 40 to 15.7 bits per value (-60%) with the patch. Ahhh, I see! The taxis that go faster than the speed of light are not apparent now since we don't store that field directly... makes sense. > Prevent outliers from raising the number of bits of everyone with numeric doc > values > > > Key: LUCENE-7589 > URL: https://issues.apache.org/jira/browse/LUCENE-7589 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Attachments: LUCENE-7589.patch > > > Today we encode entire segments with a single number of bits per value. It > was done this way because it was faster, but it also means a single outlier > can significantly increase the space requirements. I think we should have > protection against that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7589) Prevent outliers from raising the number of bits of everyone with numeric doc values
[ https://issues.apache.org/jira/browse/LUCENE-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15741431#comment-15741431 ] Adrien Grand commented on LUCENE-7589: -- Thanks Mike for having a look. The patch does not show much reduction in spite of the quality issues of the dataset since existing fields tend to not have outliers. However if you add a new field that stores the average number of miles per hour as a long doc values field, then it highlights the quality issues of this dataset and disk usage for this field goes from 40 to 15.7 bits per value (-60%) with the patch. > Prevent outliers from raising the number of bits of everyone with numeric doc > values > > > Key: LUCENE-7589 > URL: https://issues.apache.org/jira/browse/LUCENE-7589 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Attachments: LUCENE-7589.patch > > > Today we encode entire segments with a single number of bits per value. It > was done this way because it was faster, but it also means a single outlier > can significantly increase the space requirements. I think we should have > protection against that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7589) Prevent outliers from raising the number of bits of everyone with numeric doc values
[ https://issues.apache.org/jira/browse/LUCENE-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736776#comment-15736776 ] Michael McCandless commented on LUCENE-7589: The patch looks great, just this minor typo: {{ values for te next block.}} --> {{ values for the next block.}} This seems to give ~3.7% reduction in the doc values disk used for sparse taxis! > Prevent outliers from raising the number of bits of everyone with numeric doc > values > > > Key: LUCENE-7589 > URL: https://issues.apache.org/jira/browse/LUCENE-7589 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Attachments: LUCENE-7589.patch > > > Today we encode entire segments with a single number of bits per value. It > was done this way because it was faster, but it also means a single outlier > can significantly increase the space requirements. I think we should have > protection against that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org