Zoltan Ivanfi created PARQUET-1225:
--------------------------------------

             Summary: NaN values may lead to incorrect filtering under certain 
circumstances
                 Key: PARQUET-1225
                 URL: https://issues.apache.org/jira/browse/PARQUET-1225
             Project: Parquet
          Issue Type: Task
          Components: parquet-cpp
            Reporter: Zoltan Ivanfi


_This JIRA describes a generic problem with floating point comparisons that 
*most probably* affects parquet-cpp. It is known to affect Impala and by taking 
a quick look at the parquet-cpp code it seems to affect parquet-cpp as well, 
but it has not yet been confirmed in practice._

For comparing float and double values for min/max stats, parquet-cpp uses the 
C++ less-than operator (<) that return false for comparisons involving a NaN. 
This means that while garthering statistics, if a NaN is the smallest value 
encountered so far (which happens to be the case after reading the first value 
if that value is NaN), no other value can ever replace it, since < will always 
be false. On the other hand, if NaN is not the first value, it won't affect the 
min value. So the min value depends on the order of elements.

If looking for specific values while reading back the data, the NaN value may 
lead to row groups being incorrectly discarded in spite of having matching 
rows. For details, please see the Imapala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to