[jira] [Comment Edited] (PARQUET-1222) Definition of float and double sort order is ambigious

Zoltan Ivanfi (JIRA) Thu, 22 Feb 2018 01:42:26 -0800

    [ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371529#comment-16371529
 ]


Zoltan Ivanfi edited comment on PARQUET-1222 at 2/22/18 9:41 AM:
-----------------------------------------------------------------

Moving the design alternatives to this comment so that the JIRA description 
focuses on the proposed solution:

We should explicitly require implementations to follow a specific comparison 
logic for these types. The candidates are:
 * The [Java 
implementation|http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/lang/Double.java#l999]
 which looks easy and efficient to implement in any language.
 * The [IEEE 754 totalOrder 
predicate|https://github.com/rust-lang/rust/issues/5585] which is rather 
complicated.
 * The [IEEE 754-2008 min and max 
operations|https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max] which 
may be hard to use for comparison, so components could not use the same sorting 
order to achieve the smallest possible min/max ranges (although a regular sort 
would probably result in an almost optimal value order).
 * We could simply require NaNs to be ignored for calculating min/max. However, 
we should also explicitly address -0/+0 values in this case, which probably 
leads to the option above.

An additional problem is how to deal with existing data:
 * One possibility is to specify legacy rules, like "if the min or max is NaN, 
it should be ignored" or that "-0 and +0 should be considered equal for min/max 
purposes".
 * Another alternative is to deprecate `min_value` and `max_value` and 
introduce `yet_another_min` and `yet_another_max` fields instead (with nicer 
names, naturally). This could be combined with some legacy rules for the old 
field.
 * Probably the best solution would be to deprecate TypeDefinedOrder for 
doubles and floats and introduce a new TotalOrder. The legacy rule "if the min 
or max is NaN, it should be ignored" should apply to TypeDefinedOrder while the 
new TotalOrder would not have such restrictions. The default for writing 
doubles and floats would be the new TotalOrder.


was (Author: zi):
Moving the design alternatives to this comment so that the JIRA description 
focuses on the proposed solution:

We should explicitly require implementations to follow a specific comparison 
logic for these types. The candidates are:
 * The [Java 
implementation|http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/lang/Double.java#l999]
 which looks easy and efficient to implement in any language.
 * The [IEEE 754 totalOrder 
predicate|https://github.com/rust-lang/rust/issues/5585] which is rather 
complicated to the extent that it is hard to tell whether the Java 
implementation adheres to it, so in effect this option may actually be the same 
as the one above.
 * The [IEEE 754-2008 min and max 
operations|https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max] which 
may be hard to use for comparison, so components could not use the same sorting 
order to achieve the smallest possible min/max ranges (although a regular sort 
would probably result in an almost optimal value order).
 * We could simply require NaNs to be ignored for calculating min/max. However, 
we should also explicitly address -0/+0 values in this case, which probably 
leads to the option above.

An additional problem is how to deal with existing data:
 * One possibility is to specify legacy rules, like "if the min or max is NaN, 
it should be ignored" or that "-0 and +0 should be considered equal for min/max 
purposes".
 * Another alternative is to deprecate `min_value` and `max_value` and 
introduce `yet_another_min` and `yet_another_max` fields instead (with nicer 
names, naturally). This could be combined with some legacy rules for the old 
field.
 * Probably the best solution would be to deprecate TypeDefinedOrder for 
doubles and floats and introduce a new TotalOrder. The legacy rule "if the min 
or max is NaN, it should be ignored" should apply to TypeDefinedOrder while the 
new TotalOrder would not have such restrictions. The default for writing 
doubles and floats would be the new TotalOrder.

> Definition of float and double sort order is ambigious
> ------------------------------------------------------
>
>                 Key: PARQUET-1222
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1222
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Zoltan Ivanfi
>            Priority: Critical
>             Fix For: format-2.5.0
>
>         Attachments: ordering.png
>
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>    *   FLOAT - signed comparison of the represented value
>    *   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. The proposed ordering is 
> the following:
>  * -∞
>  * negative numbers in their natural order
>  * -0 and +0 in the same equivalence class \(!)
>  * positive numbers in their natural order
>  * +∞
>  * all NaN values, including the negative ones \(!), in the same equivalence 
> class \(!)
> This ordering should be effective and easy to implement in all programming 
> languages. A visual representation of the ordering of some example values:
> !ordering.png|width=640px!
> For reading existing stats created using TypeDefinedOrder, the following 
> compatibility rules should be applied:
> * When looking for NaN values, min and max should be ignored.
> * If the min is a NaN, it should be ignored.
> * If the max is a NaN, it should be ignored.
> * If the min is \+0, the row group may contain -0 values as well.
> * If the max is -0, the row group may contain \+0 values as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (PARQUET-1222) Definition of float and double sort order is ambigious

Reply via email to