Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-20 Thread Alexander Behm
Today, Impala does not evaluate " != " against stats, but as Zoltan pointed out there is a way to reasonably do that. It does not work if we ignore NaN though, so we need to be careful. On Tue, Feb 20, 2018 at 9:24 AM, Zoltan Ivanfi wrote: > In parquet-mr, if you are looking for a value that is

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-20 Thread Zoltan Ivanfi
In parquet-mr, if you are looking for a value that is not equal to some reference value r and stats are min = r and max = r then that row group is discarded, because there can not be any other values in that row group. On Tue, Feb 20, 2018 at 6:21 PM Jim Apple wrote: > For that predicate in part

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-20 Thread Jim Apple
For that predicate in particular, does Impala use stats already? Let's say a column contains only the intuitive notion of floats: no NaNs, no infs, no -0.0. If we are filtering for $COL != a and the row-group stats are b <= $COL <= c, were a < b, we can know that the whole row group can be include

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-20 Thread Alexander Behm
On Mon, Feb 19, 2018 at 8:04 AM, Zoltan Ivanfi wrote: > Hi, > > Tim, I added your suggestion to introduce a new ColumnOrder to PARQUET-1222 > as the preferred > solution. > > Alex, not writing min/max if there is a NaN is indeed a feasible quic

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-19 Thread Tim Armstrong
We could drop NaNs and require that -0 be normalised to +0 when writing out stats. That would remove any degrees of freedom from the writer and then straightforward comparison with =, <, >, >=, <=, != would work as expected. On Mon, Feb 19, 2018 at 8:04 AM, Zoltan Ivanfi wrote: > Hi, > > Tim, I

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-19 Thread Zoltan Ivanfi
Hi, Tim, I added your suggestion to introduce a new ColumnOrder to PARQUET-1222 as the preferred solution. Alex, not writing min/max if there is a NaN is indeed a feasible quick-fix, but I think it would be better to just ignore NaN-s for the p

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Jim Apple
On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy wrote: > I would just like to mention that the fmax() / fmin() functions in C/C++ > Math library follow the aforementioned IEEE 754-2008 min and max > specification: > http://en.cppreference.com/w/c/numeric/math/fmax > > I think this behavior is a

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Alexander Behm
On Fri, Feb 16, 2018 at 9:38 AM, Tim Armstrong wrote: > The reader still can't correctly interpret those stats without knowing > about the behaviour of that specific writer though, because it can't assume > the absence of NaNs unless it knows that they are reading a file written by > a writer tha

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Zoltan Borok-Nagy
I would just like to mention that the fmax() / fmin() functions in C/C++ Math library follow the aforementioned IEEE 754-2008 min and max specification: http://en.cppreference.com/w/c/numeric/math/fmax I think this behavior is also the most intuitive and useful regarding to statistics. If we want

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Tim Armstrong
The reader still can't correctly interpret those stats without knowing about the behaviour of that specific writer though, because it can't assume the absence of NaNs unless it knows that they are reading a file written by a writer that drops stats when it sees NaNs. It *could* fix the behaviour o

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Lars Volker
Yeah, I missed that. We set it per column, so all other types could keep TypeDefinedOrder and floats could have something like NanAwareDoubleOrder. On Fri, Feb 16, 2018 at 9:18 AM, Tim Armstrong wrote: > We wouldn't need to rev the whole TypeDefinedOrder thing right? Couldn't we > just define a

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Alexander Behm
On Fri, Feb 16, 2018 at 9:15 AM, Tim Armstrong wrote: > I don't see a major benefit to a temporary solution. The files are already > out there and we need to implement a fix on the read path regardless. If we > keep writing the stats there's at least some information contained in the > stats that

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Tim Armstrong
We wouldn't need to rev the whole TypeDefinedOrder thing right? Couldn't we just define a special order for floats? Essentially it would be a tag for writers to say "hey I know about this total order thing". On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker wrote: > I think one idea behind the column

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Tim Armstrong
I don't see a major benefit to a temporary solution. The files are already out there and we need to implement a fix on the read path regardless. If we keep writing the stats there's at least some information contained in the stats that readers can make use of, if they want to implement the required

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Lars Volker
I think one idea behind the column order fields was that if a reader does not recognize a value there, it needs to ignore the stats. If I remember correctly, that was intended to allow us to add new orderings for collations, but it also seems useful to address gaps in the spec or known broken reade

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Alexander Behm
I hope the common cases is that data files do not contain these special float values. As the simplest solution, how about writers refrain from populating the stats if a special value is encountered? That fix does not preclude a more thorough solution in the future, but it addresses the common case

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Tim Armstrong
There is an extensibility mechanism with the ColumnOrder union - I think that was meant to avoid the need to add new stat fields? Given that the bug was in the Parquet spec, we'll need to make a spec change anyway, so we could add a new ColumnOrder - FloatingPointTotalOrder? at the same time as fi

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Jim Apple
> We could have a similar problem > with not finding +0.0 values because a -0.0 is written to the max_value > field by some component that considers them the same. My hope is that the filtering would behave sanely, since -0.0 == +0.0 under the real-number-inspired ordering, which is distinguished

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Zoltan Ivanfi
Hi, I don't think it would be worth to keep a separate NaN count, but we could ignore them when calculating min/max stats regardless. However, NaN is not the only value preventing total ordering. We could have a similar problem with not finding +0.0 values because a -0.0 is written to the max_valu

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-15 Thread Tim Armstrong
We could also consider treating NaN similar to NULL and having a separate piece of information with a count of NaN values (or just a bit indicating presence/absence of NaN). I'm not sure if that is easier or harder to implement than a total order. On Thu, Feb 15, 2018 at 9:12 AM, Laszlo Gaal wrot

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-15 Thread Laszlo Gaal
To supply some context: Impala has had a number of issues around NaN/infinity: The closest precedent related to the current issue seems to be IMPALA-6295 :