[jira] [Commented] (LUCENE-7589) Prevent outliers from raising the number of bits of everyone with numeric doc values

2016-12-19 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761076#comment-15761076
 ] 

ASF subversion and git services commented on LUCENE-7589:
-

Commit 3b182aa2fb3e4062f6ec5be819f3aa70aa2e523d in lucene-solr's branch 
refs/heads/feature/metrics from [~jpountz]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3b182aa ]

LUCENE-7589: Prevent outliers from raising the bpv for everyone.


> Prevent outliers from raising the number of bits of everyone with numeric doc 
> values
> 
>
> Key: LUCENE-7589
> URL: https://issues.apache.org/jira/browse/LUCENE-7589
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: master (7.0)
>
> Attachments: LUCENE-7589.patch
>
>
> Today we encode entire segments with a single number of bits per value. It 
> was done this way because it was faster, but it also means a single outlier 
> can significantly increase the space requirements. I think we should have 
> protection against that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7589) Prevent outliers from raising the number of bits of everyone with numeric doc values

2016-12-16 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15754558#comment-15754558
 ] 

Adrien Grand commented on LUCENE-7589:
--

Like Mike predicted, this helped the NYC taxi bench a bit, Disk usage for the 
dropoff datetime field went from 194MB to 166MB: 
http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#index_size_by_field



> Prevent outliers from raising the number of bits of everyone with numeric doc 
> values
> 
>
> Key: LUCENE-7589
> URL: https://issues.apache.org/jira/browse/LUCENE-7589
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: master (7.0)
>
> Attachments: LUCENE-7589.patch
>
>
> Today we encode entire segments with a single number of bits per value. It 
> was done this way because it was faster, but it also means a single outlier 
> can significantly increase the space requirements. I think we should have 
> protection against that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7589) Prevent outliers from raising the number of bits of everyone with numeric doc values

2016-12-15 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15751785#comment-15751785
 ] 

ASF subversion and git services commented on LUCENE-7589:
-

Commit 3b182aa2fb3e4062f6ec5be819f3aa70aa2e523d in lucene-solr's branch 
refs/heads/master from [~jpountz]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3b182aa ]

LUCENE-7589: Prevent outliers from raising the bpv for everyone.


> Prevent outliers from raising the number of bits of everyone with numeric doc 
> values
> 
>
> Key: LUCENE-7589
> URL: https://issues.apache.org/jira/browse/LUCENE-7589
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7589.patch
>
>
> Today we encode entire segments with a single number of bits per value. It 
> was done this way because it was faster, but it also means a single outlier 
> can significantly increase the space requirements. I think we should have 
> protection against that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7589) Prevent outliers from raising the number of bits of everyone with numeric doc values

2016-12-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15741566#comment-15741566
 ] 

Michael McCandless commented on LUCENE-7589:


bq. However if you add a new field that stores the average number of miles per 
hour as a long doc values field, then it highlights the quality issues of this 
dataset and disk usage for this field goes from 40 to 15.7 bits per value 
(-60%) with the patch.

Ahhh, I see!  The taxis that go faster than the speed of light are not apparent 
now since we don't store that field directly... makes sense.

> Prevent outliers from raising the number of bits of everyone with numeric doc 
> values
> 
>
> Key: LUCENE-7589
> URL: https://issues.apache.org/jira/browse/LUCENE-7589
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7589.patch
>
>
> Today we encode entire segments with a single number of bits per value. It 
> was done this way because it was faster, but it also means a single outlier 
> can significantly increase the space requirements. I think we should have 
> protection against that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7589) Prevent outliers from raising the number of bits of everyone with numeric doc values

2016-12-12 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15741431#comment-15741431
 ] 

Adrien Grand commented on LUCENE-7589:
--

Thanks Mike for having a look. The patch does not show much reduction in spite 
of the quality issues of the dataset since existing fields tend to not have 
outliers. However if you add a new field that stores the average number of 
miles per hour as a long doc values field, then it highlights the quality 
issues of this dataset and disk usage for this field goes from 40 to 15.7 bits 
per value (-60%) with the patch.

> Prevent outliers from raising the number of bits of everyone with numeric doc 
> values
> 
>
> Key: LUCENE-7589
> URL: https://issues.apache.org/jira/browse/LUCENE-7589
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7589.patch
>
>
> Today we encode entire segments with a single number of bits per value. It 
> was done this way because it was faster, but it also means a single outlier 
> can significantly increase the space requirements. I think we should have 
> protection against that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7589) Prevent outliers from raising the number of bits of everyone with numeric doc values

2016-12-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736776#comment-15736776
 ] 

Michael McCandless commented on LUCENE-7589:


The patch looks great, just this minor typo:

{{ values for te next block.}} --> {{ values for the next block.}}

This seems to give ~3.7% reduction in the doc values disk used for sparse taxis!

> Prevent outliers from raising the number of bits of everyone with numeric doc 
> values
> 
>
> Key: LUCENE-7589
> URL: https://issues.apache.org/jira/browse/LUCENE-7589
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7589.patch
>
>
> Today we encode entire segments with a single number of bits per value. It 
> was done this way because it was faster, but it also means a single outlier 
> can significantly increase the space requirements. I think we should have 
> protection against that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org