[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16376502#comment-16376502
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

zivanfi commented on issue #444: PARQUET-1225: NaN values may lead to incorrect 
filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#issuecomment-368420140
 
 
   I agree, thanks for your efforts!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
> Fix For: cpp-1.4.0
>
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16376315#comment-16376315
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

majetideepak commented on issue #444: PARQUET-1225: NaN values may lead to 
incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#issuecomment-368362908
 
 
   I think we are ready to make a new RC. thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
> Fix For: cpp-1.4.0
>
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375715#comment-16375715
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

xhochy closed pull request #444: PARQUET-1225: NaN values may lead to incorrect 
filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/parquet/statistics-test.cc b/src/parquet/statistics-test.cc
index 1bbef26e..54740166 100644
--- a/src/parquet/statistics-test.cc
+++ b/src/parquet/statistics-test.cc
@@ -659,5 +659,124 @@ TEST_F(TestStatisticsFLBA, UnknownSortOrder) {
   ASSERT_FALSE(cc_metadata->is_stats_set());
 }
 
+// PARQUET-1225: Float NaN values may lead to incorrect filtering under certain
+// circumstances
+TEST(TestStatisticsFloatNaN, NaNValues) {
+  constexpr int NUM_VALUES = 10;
+  NodePtr node = PrimitiveNode::Make("nan_float", Repetition::OPTIONAL, 
Type::FLOAT);
+  ColumnDescriptor descr(node, 1, 1);
+  float values[NUM_VALUES] = {std::nanf(""), -4.0f, -3.0f, -2.0f, -1.0f,
+  std::nanf(""), 1.0f,  2.0f,  3.0f,  
std::nanf("")};
+  float nan_values[NUM_VALUES];
+  for (int i = 0; i < NUM_VALUES; i++) {
+nan_values[i] = std::nanf("");
+  }
+
+  // Test values
+  TypedRowGroupStatistics nan_stats();
+  nan_stats.Update([0], NUM_VALUES, 0);
+  float min = nan_stats.min();
+  float max = nan_stats.max();
+  ASSERT_EQ(min, -4.0f);
+  ASSERT_EQ(max, 3.0f);
+
+  // Test all NaNs
+  TypedRowGroupStatistics all_nan_stats();
+  all_nan_stats.Update(_values[0], NUM_VALUES, 0);
+  min = all_nan_stats.min();
+  max = all_nan_stats.max();
+  ASSERT_TRUE(std::isnan(min));
+  ASSERT_TRUE(std::isnan(max));
+
+  // Test values followed by all NaNs
+  nan_stats.Update(_values[0], NUM_VALUES, 0);
+  min = nan_stats.min();
+  max = nan_stats.max();
+  ASSERT_EQ(min, -4.0f);
+  ASSERT_EQ(max, 3.0f);
+
+  // Test all NaNs followed by values
+  all_nan_stats.Update([0], NUM_VALUES, 0);
+  min = all_nan_stats.min();
+  max = all_nan_stats.max();
+  ASSERT_EQ(min, -4.0f);
+  ASSERT_EQ(max, 3.0f);
+
+  // Test values followed by all NaNs followed by values
+  nan_stats.Update([0], NUM_VALUES, 0);
+  min = nan_stats.min();
+  max = nan_stats.max();
+  ASSERT_EQ(min, -4.0f);
+  ASSERT_EQ(max, 3.0f);
+}
+
+// PARQUET-1225: Float NaN values may lead to incorrect filtering under certain
+// circumstances
+TEST(TestStatisticsFloatNaN, NaNValuesSpaced) {
+  constexpr int NUM_VALUES = 10;
+  NodePtr node = PrimitiveNode::Make("nan_float", Repetition::OPTIONAL, 
Type::FLOAT);
+  ColumnDescriptor descr(node, 1, 1);
+  float values[NUM_VALUES] = {std::nanf(""), -4.0f, -3.0f, -2.0f, -1.0f,
+  std::nanf(""), 1.0f,  2.0f,  3.0f,  
std::nanf("")};
+  float nan_values[NUM_VALUES];
+  for (int i = 0; i < NUM_VALUES; i++) {
+nan_values[i] = std::nanf("");
+  }
+  std::vector valid_bits(BitUtil::RoundUpNumBytes(NUM_VALUES) + 1, 
255);
+
+  // Test values
+  TypedRowGroupStatistics nan_stats();
+  nan_stats.UpdateSpaced([0], valid_bits.data(), 0, NUM_VALUES, 0);
+  float min = nan_stats.min();
+  float max = nan_stats.max();
+  ASSERT_EQ(min, -4.0f);
+  ASSERT_EQ(max, 3.0f);
+
+  // Test all NaNs
+  TypedRowGroupStatistics all_nan_stats();
+  all_nan_stats.UpdateSpaced(_values[0], valid_bits.data(), 0, NUM_VALUES, 
0);
+  min = all_nan_stats.min();
+  max = all_nan_stats.max();
+  ASSERT_TRUE(std::isnan(min));
+  ASSERT_TRUE(std::isnan(max));
+
+  // Test values followed by all NaNs
+  nan_stats.UpdateSpaced(_values[0], valid_bits.data(), 0, NUM_VALUES, 0);
+  min = nan_stats.min();
+  max = nan_stats.max();
+  ASSERT_EQ(min, -4.0f);
+  ASSERT_EQ(max, 3.0f);
+
+  // Test all NaNs followed by values
+  all_nan_stats.UpdateSpaced([0], valid_bits.data(), 0, NUM_VALUES, 0);
+  min = all_nan_stats.min();
+  max = all_nan_stats.max();
+  ASSERT_EQ(min, -4.0f);
+  ASSERT_EQ(max, 3.0f);
+
+  // Test values followed by all NaNs followed by values
+  nan_stats.UpdateSpaced([0], valid_bits.data(), 0, NUM_VALUES, 0);
+  min = nan_stats.min();
+  max = nan_stats.max();
+  ASSERT_EQ(min, -4.0f);
+  ASSERT_EQ(max, 3.0f);
+}
+
+// NaN double values may lead to incorrect filtering under certain 
circumstances
+TEST(TestStatisticsDoubleNaN, NaNValues) {
+  constexpr int NUM_VALUES = 10;
+  NodePtr node = PrimitiveNode::Make("nan_double", Repetition::OPTIONAL, 
Type::DOUBLE);
+  ColumnDescriptor descr(node, 1, 1);
+  TypedRowGroupStatistics nan_stats();
+  double values[NUM_VALUES] = {std::nan(""), std::nan(""), -3.0, -2.0, -1.0,
+   0.0,  1.0,  2.0,  3.0,  4.0};
+  double* values_ptr = [0];
+  

[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375714#comment-16375714
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

xhochy commented on issue #444: PARQUET-1225: NaN values may lead to incorrect 
filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#issuecomment-368248534
 
 
   @majetideepak @boroknagyz @zivanfi I think with the PR we are now ready to 
make a new RC? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
> Fix For: cpp-1.4.0
>
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372978#comment-16372978
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

boroknagyz commented on issue #444: PARQUET-1225: NaN values may lead to 
incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#issuecomment-367726337
 
 
   Thanks for applying the changes! To me it looks good generally.
   The Update() and UpdateSpaced() functions share some common code parts (not 
necessarily introduced by this PR, e.g. the last if statements of the 
functions), maybe worth a little refactoring.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372008#comment-16372008
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

majetideepak commented on issue #444: PARQUET-1225: NaN values may lead to 
incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#issuecomment-367471665
 
 
   All the changes have been made. @boroknagyz  and @wesm  please let me know 
if you have more feedback!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371576#comment-16371576
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

majetideepak commented on a change in pull request #444: [WIP] PARQUET-1225: 
NaN values may lead to incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#discussion_r169683068
 
 

 ##
 File path: src/parquet/statistics.cc
 ##
 @@ -96,6 +96,67 @@ void TypedRowGroupStatistics::Reset() {
   has_min_max_ = false;
 }
 
+template 
+inline int getValueBeginOffset(const T* values, int64_t count) {
+  return 0;
+}
+
+template 
+inline int getValueEndOffset(const T* values, int64_t count) {
+  return count;
+}
+
+template 
+inline bool notNaN (const T* value) {
+  return true;
+}
+
+template <>
+inline int getValueBeginOffset(const float* values, int64_t count) {
 
 Review comment:
   Thanks for this tip. Very helpful! Will include other review comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371553#comment-16371553
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

wesm commented on a change in pull request #444: [WIP] PARQUET-1225: NaN values 
may lead to incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#discussion_r169673541
 
 

 ##
 File path: src/parquet/statistics.cc
 ##
 @@ -96,6 +96,67 @@ void TypedRowGroupStatistics::Reset() {
   has_min_max_ = false;
 }
 
+template 
+inline int getValueBeginOffset(const T* values, int64_t count) {
+  return 0;
+}
+
+template 
+inline int getValueEndOffset(const T* values, int64_t count) {
+  return count;
+}
+
+template 
+inline bool notNaN (const T* value) {
+  return true;
+}
+
+template <>
+inline int getValueBeginOffset(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = 0; i < count; i++) {
+ if (!std::isnan(values[i])) return i;
+  }
+  return count;
+}
+
+template <>
+inline int getValueEndOffset(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = (count - 1); i > 0; i--) {
+ if (!std::isnan(values[i])) return (i + 1);
 
 Review comment:
   Braces


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371552#comment-16371552
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

wesm commented on a change in pull request #444: [WIP] PARQUET-1225: NaN values 
may lead to incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#discussion_r169675496
 
 

 ##
 File path: src/parquet/statistics.cc
 ##
 @@ -96,6 +96,67 @@ void TypedRowGroupStatistics::Reset() {
   has_min_max_ = false;
 }
 
+template 
+inline int getValueBeginOffset(const T* values, int64_t count) {
+  return 0;
+}
+
+template 
+inline int getValueEndOffset(const T* values, int64_t count) {
+  return count;
+}
+
+template 
+inline bool notNaN (const T* value) {
+  return true;
+}
+
+template <>
+inline int getValueBeginOffset(const float* values, int64_t count) {
 
 Review comment:
   If we used `std::is_floating_point` and the functor pattern, we could avoid 
code duplication
   
   ```c++
   template 
   struct StatsHelper {
 ...
   };
   
   template 
   struct StatsHelper::type> {
 ...
   };
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371547#comment-16371547
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

wesm commented on a change in pull request #444: [WIP] PARQUET-1225: NaN values 
may lead to incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#discussion_r169673332
 
 

 ##
 File path: src/parquet/statistics.cc
 ##
 @@ -96,6 +96,67 @@ void TypedRowGroupStatistics::Reset() {
   has_min_max_ = false;
 }
 
+template 
+inline int getValueBeginOffset(const T* values, int64_t count) {
+  return 0;
+}
+
+template 
+inline int getValueEndOffset(const T* values, int64_t count) {
+  return count;
+}
+
+template 
+inline bool notNaN (const T* value) {
+  return true;
+}
+
+template <>
+inline int getValueBeginOffset(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = 0; i < count; i++) {
+ if (!std::isnan(values[i])) return i;
 
 Review comment:
   Add braces


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371549#comment-16371549
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

wesm commented on a change in pull request #444: [WIP] PARQUET-1225: NaN values 
may lead to incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#discussion_r169673252
 
 

 ##
 File path: src/parquet/statistics.cc
 ##
 @@ -96,6 +96,67 @@ void TypedRowGroupStatistics::Reset() {
   has_min_max_ = false;
 }
 
+template 
+inline int getValueBeginOffset(const T* values, int64_t count) {
 
 Review comment:
   Can we use compliant naming in these new functions (function names should be 
capitalized)?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371548#comment-16371548
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

wesm commented on a change in pull request #444: [WIP] PARQUET-1225: NaN values 
may lead to incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#discussion_r169674237
 
 

 ##
 File path: src/parquet/statistics.cc
 ##
 @@ -96,6 +96,67 @@ void TypedRowGroupStatistics::Reset() {
   has_min_max_ = false;
 }
 
+template 
+inline int getValueBeginOffset(const T* values, int64_t count) {
+  return 0;
+}
+
+template 
+inline int getValueEndOffset(const T* values, int64_t count) {
+  return count;
+}
+
+template 
+inline bool notNaN (const T* value) {
+  return true;
+}
+
+template <>
+inline int getValueBeginOffset(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = 0; i < count; i++) {
+ if (!std::isnan(values[i])) return i;
+  }
+  return count;
+}
+
+template <>
+inline int getValueEndOffset(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = (count - 1); i > 0; i--) {
+ if (!std::isnan(values[i])) return (i + 1);
+  }
+  return count;
+}
+
+template <>
+inline bool notNaN(const float* value) {
+  return !std::isnan(*value);
 
 Review comment:
   Would it be better to use `return value == value` here (with the pointer -> 
value change per above)? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371546#comment-16371546
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

wesm commented on a change in pull request #444: [WIP] PARQUET-1225: NaN values 
may lead to incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#discussion_r169673913
 
 

 ##
 File path: src/parquet/statistics.cc
 ##
 @@ -96,6 +96,67 @@ void TypedRowGroupStatistics::Reset() {
   has_min_max_ = false;
 }
 
+template 
+inline int getValueBeginOffset(const T* values, int64_t count) {
+  return 0;
+}
+
+template 
+inline int getValueEndOffset(const T* values, int64_t count) {
+  return count;
+}
+
+template 
+inline bool notNaN (const T* value) {
 
 Review comment:
   Pass `const T value` instead of pointer?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371551#comment-16371551
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

wesm commented on a change in pull request #444: [WIP] PARQUET-1225: NaN values 
may lead to incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#discussion_r169674273
 
 

 ##
 File path: src/parquet/statistics.cc
 ##
 @@ -96,6 +96,67 @@ void TypedRowGroupStatistics::Reset() {
   has_min_max_ = false;
 }
 
+template 
+inline int getValueBeginOffset(const T* values, int64_t count) {
+  return 0;
+}
+
+template 
+inline int getValueEndOffset(const T* values, int64_t count) {
+  return count;
+}
+
+template 
+inline bool notNaN (const T* value) {
+  return true;
+}
+
+template <>
+inline int getValueBeginOffset(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = 0; i < count; i++) {
+ if (!std::isnan(values[i])) return i;
+  }
+  return count;
+}
+
+template <>
+inline int getValueEndOffset(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = (count - 1); i > 0; i--) {
+ if (!std::isnan(values[i])) return (i + 1);
+  }
+  return count;
+}
+
+template <>
+inline bool notNaN(const float* value) {
+  return !std::isnan(*value);
+}
+
+template <>
+inline int getValueBeginOffset(const double* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = 0; i < count; i++) {
+ if (!std::isnan(values[i])) return i;
 
 Review comment:
   Braces


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371375#comment-16371375
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

majetideepak commented on a change in pull request #444: [WIP] PARQUET-1225: 
NaN values may lead to incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#discussion_r169628741
 
 

 ##
 File path: src/parquet/statistics.cc
 ##
 @@ -96,6 +96,67 @@ void TypedRowGroupStatistics::Reset() {
   has_min_max_ = false;
 }
 
+template 
+inline int getValueBeginOffset(const T* values, int64_t count) {
+  return 0;
+}
+
+template 
+inline int getValueEndOffset(const T* values, int64_t count) {
+  return count;
+}
+
+template 
+inline bool notNaN (const T* value) {
+  return true;
+}
+
+template <>
+inline int getValueBeginOffset(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = 0; i < count; i++) {
+ if (!std::isnan(values[i])) return i;
+  }
+  return count;
+}
+
+template <>
+inline int getValueEndOffset(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = (count - 1); i > 0; i--) {
+ if (!std::isnan(values[i])) return (i + 1);
+  }
+  return count;
+}
+
+template <>
+inline bool notNaN(const float* value) {
+  return !std::isnan(*value);
+}
+
+template <>
+inline int getValueBeginOffset(const double* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = 0; i < count; i++) {
+ if (!std::isnan(values[i])) return i;
+  }
+  return 0;
 
 Review comment:
   Fixed!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371376#comment-16371376
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

majetideepak commented on a change in pull request #444: [WIP] PARQUET-1225: 
NaN values may lead to incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#discussion_r169628841
 
 

 ##
 File path: src/parquet/statistics.cc
 ##
 @@ -96,6 +96,67 @@ void TypedRowGroupStatistics::Reset() {
   has_min_max_ = false;
 }
 
+template 
+inline int getValueBeginOffset(const T* values, int64_t count) {
+  return 0;
+}
+
+template 
+inline int getValueEndOffset(const T* values, int64_t count) {
+  return count;
+}
+
+template 
+inline bool notNaN (const T* value) {
+  return true;
+}
+
+template <>
+inline int getValueBeginOffset(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = 0; i < count; i++) {
+ if (!std::isnan(values[i])) return i;
+  }
+  return count;
+}
+
+template <>
+inline int getValueEndOffset(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = (count - 1); i > 0; i--) {
+ if (!std::isnan(values[i])) return (i + 1);
+  }
+  return count;
+}
+
+template <>
+inline bool notNaN(const float* value) {
+  return !std::isnan(*value);
+}
+
+template <>
+inline int getValueBeginOffset(const double* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = 0; i < count; i++) {
+ if (!std::isnan(values[i])) return i;
+  }
+  return 0;
+}
+
+template <>
+inline int getValueEndOffset(const double* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = (count - 1); i > 0; i--) {
 
 Review comment:
   Fixed!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371373#comment-16371373
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

majetideepak commented on a change in pull request #444: [WIP] PARQUET-1225: 
NaN values may lead to incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#discussion_r169628691
 
 

 ##
 File path: src/parquet/statistics.cc
 ##
 @@ -96,6 +96,67 @@ void TypedRowGroupStatistics::Reset() {
   has_min_max_ = false;
 }
 
+template 
+inline int getValueBeginOffset(const T* values, int64_t count) {
+  return 0;
+}
+
+template 
+inline int getValueEndOffset(const T* values, int64_t count) {
+  return count;
+}
+
+template 
+inline bool notNaN (const T* value) {
+  return true;
+}
+
+template <>
+inline int getValueBeginOffset(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = 0; i < count; i++) {
+ if (!std::isnan(values[i])) return i;
+  }
+  return count;
+}
+
+template <>
+inline int getValueEndOffset(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = (count - 1); i > 0; i--) {
+ if (!std::isnan(values[i])) return (i + 1);
+  }
+  return count;
 
 Review comment:
   Fixed!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371319#comment-16371319
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

majetideepak commented on a change in pull request #444: [WIP] PARQUET-1225: 
NaN values may lead to incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#discussion_r169621193
 
 

 ##
 File path: src/parquet/statistics.cc
 ##
 @@ -96,6 +96,67 @@ void TypedRowGroupStatistics::Reset() {
   has_min_max_ = false;
 }
 
+template 
+inline int getValueBeginOffset(const T* values, int64_t count) {
+  return 0;
+}
+
+template 
+inline int getValueEndOffset(const T* values, int64_t count) {
+  return count;
+}
+
+template 
+inline bool notNaN (const T* value) {
+  return true;
+}
+
+template <>
+inline int getValueBeginOffset(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = 0; i < count; i++) {
+ if (!std::isnan(values[i])) return i;
+  }
+  return count;
+}
+
+template <>
+inline int getValueEndOffset(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = (count - 1); i > 0; i--) {
 
 Review comment:
   correct!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371285#comment-16371285
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

zivanfi commented on a change in pull request #444: [WIP] PARQUET-1225: NaN 
values may lead to incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#discussion_r169615220
 
 

 ##
 File path: src/parquet/statistics-test.cc
 ##
 @@ -659,5 +659,37 @@ TEST_F(TestStatisticsFLBA, UnknownSortOrder) {
   ASSERT_FALSE(cc_metadata->is_stats_set());
 }
 
+// PARQUET-1225: Float NaN values may lead to incorrect filtering under 
certain circumstances
+TEST(TestStatisticsFloatNan, NaNValues) {
+  constexpr int NUM_VALUES = 10;
+  NodePtr node = PrimitiveNode::Make("nan_float", Repetition::OPTIONAL, 
Type::FLOAT);
+  ColumnDescriptor descr(node, 1, 1);
+  TypedRowGroupStatistics nan_stats();
+  float values[NUM_VALUES] =
+{std::nanf(""), -4.0f, -3.0f, -2.0f, -1.0f, std::nanf(""), 1.0f, 2.0f, 
3.0f, std::nanf("")};
+  nan_stats.Update([0], NUM_VALUES, 0);
+  float min = nan_stats.min();
+  float max = nan_stats.max();
+
+  ASSERT_EQ(min, -4.0f);
+  ASSERT_EQ(max, 3.0f);
+}
+
+// NaN double values may lead to incorrect filtering under certain 
circumstances
+TEST(TestStatisticsDoubleNan, NaNValues) {
+  constexpr int NUM_VALUES = 10;
+  NodePtr node = PrimitiveNode::Make("nan_double", Repetition::OPTIONAL, 
Type::DOUBLE);
+  ColumnDescriptor descr(node, 1, 1);
+  TypedRowGroupStatistics nan_stats();
+  double values[NUM_VALUES] =
+{std::nan(""), std::nan(""), -3.0, -2.0, -1.0, 0.0, 1.0, 2.0, 3.0, 
4.0};
+  double* values_ptr = [0];
+  nan_stats.Update(values_ptr, NUM_VALUES, 0);
+  double min = nan_stats.min();
+  double max = nan_stats.max();
+
+  ASSERT_EQ(min, -3.0);
+  ASSERT_EQ(max, 4.0);
+}
 }  // namespace test
 
 Review comment:
   Ah, sorry, I just noticed you listed this yourself as something that is 
still left to be done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371281#comment-16371281
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

zivanfi commented on a change in pull request #444: [WIP] PARQUET-1225: NaN 
values may lead to incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#discussion_r169614602
 
 

 ##
 File path: src/parquet/statistics-test.cc
 ##
 @@ -659,5 +659,37 @@ TEST_F(TestStatisticsFLBA, UnknownSortOrder) {
   ASSERT_FALSE(cc_metadata->is_stats_set());
 }
 
+// PARQUET-1225: Float NaN values may lead to incorrect filtering under 
certain circumstances
+TEST(TestStatisticsFloatNan, NaNValues) {
+  constexpr int NUM_VALUES = 10;
+  NodePtr node = PrimitiveNode::Make("nan_float", Repetition::OPTIONAL, 
Type::FLOAT);
+  ColumnDescriptor descr(node, 1, 1);
+  TypedRowGroupStatistics nan_stats();
+  float values[NUM_VALUES] =
+{std::nanf(""), -4.0f, -3.0f, -2.0f, -1.0f, std::nanf(""), 1.0f, 2.0f, 
3.0f, std::nanf("")};
+  nan_stats.Update([0], NUM_VALUES, 0);
+  float min = nan_stats.min();
+  float max = nan_stats.max();
+
+  ASSERT_EQ(min, -4.0f);
+  ASSERT_EQ(max, 3.0f);
+}
+
+// NaN double values may lead to incorrect filtering under certain 
circumstances
+TEST(TestStatisticsDoubleNan, NaNValues) {
+  constexpr int NUM_VALUES = 10;
+  NodePtr node = PrimitiveNode::Make("nan_double", Repetition::OPTIONAL, 
Type::DOUBLE);
+  ColumnDescriptor descr(node, 1, 1);
+  TypedRowGroupStatistics nan_stats();
+  double values[NUM_VALUES] =
+{std::nan(""), std::nan(""), -3.0, -2.0, -1.0, 0.0, 1.0, 2.0, 3.0, 
4.0};
+  double* values_ptr = [0];
+  nan_stats.Update(values_ptr, NUM_VALUES, 0);
+  double min = nan_stats.min();
+  double max = nan_stats.max();
+
+  ASSERT_EQ(min, -3.0);
+  ASSERT_EQ(max, 4.0);
+}
 }  // namespace test
 
 Review comment:
   I would suggest adding a test case for all values being NaN.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371282#comment-16371282
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

zivanfi commented on a change in pull request #444: [WIP] PARQUET-1225: NaN 
values may lead to incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#discussion_r169614173
 
 

 ##
 File path: src/parquet/statistics.cc
 ##
 @@ -107,8 +168,17 @@ void TypedRowGroupStatistics::Update(const T* 
values, int64_t num_not_nul
   // TODO: support distinct count?
   if (num_not_null == 0) return;
 
+  // PARQUET-1225: Handle NaNs
+  // The problem arises only if the starting/ending value(s)
 
 Review comment:
   Do NaN-s at the end actually cause trouble? I had the impression that only 
NaN-s at the beginning are problematic.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371224#comment-16371224
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

boroknagyz commented on a change in pull request #444: [WIP] PARQUET-1225: NaN 
values may lead to incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#discussion_r169597763
 
 

 ##
 File path: src/parquet/statistics.cc
 ##
 @@ -96,6 +96,67 @@ void TypedRowGroupStatistics::Reset() {
   has_min_max_ = false;
 }
 
+template 
+inline int getValueBeginOffset(const T* values, int64_t count) {
+  return 0;
+}
+
+template 
+inline int getValueEndOffset(const T* values, int64_t count) {
+  return count;
+}
+
+template 
+inline bool notNaN (const T* value) {
+  return true;
+}
+
+template <>
+inline int getValueBeginOffset(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = 0; i < count; i++) {
+ if (!std::isnan(values[i])) return i;
+  }
+  return count;
+}
+
+template <>
+inline int getValueEndOffset(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = (count - 1); i > 0; i--) {
+ if (!std::isnan(values[i])) return (i + 1);
+  }
+  return count;
+}
+
+template <>
+inline bool notNaN(const float* value) {
+  return !std::isnan(*value);
+}
+
+template <>
+inline int getValueBeginOffset(const double* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = 0; i < count; i++) {
+ if (!std::isnan(values[i])) return i;
+  }
+  return 0;
+}
+
+template <>
+inline int getValueEndOffset(const double* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = (count - 1); i > 0; i--) {
 
 Review comment:
   I think it should be `i >= 0` instead of `i > 0`
   Think of the case when only the first element is a number, e.g.:
   `{3.14}`, or
   `{3.14, NaN, NaN, NaN, NaN, ..., NaN}`
   For these inputs, this function will return 0.
   `getValueBeginOffset()` will also return 0.
   Usually, C++ ranges are interpreted as `[first, last)`, ie. they are open at 
the end. Therefore, `[0, 0)` is an empty range.
   However, this solution will work, because at L178 you don't return when 
`begin_offset_ == end_offset`, and `minmax_element()` returns `make_pair(first, 
first)` if the range is empty.
   
   But I feel like it currently works by accident, and would be better to fix 
this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371223#comment-16371223
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

boroknagyz commented on a change in pull request #444: [WIP] PARQUET-1225: NaN 
values may lead to incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#discussion_r169592780
 
 

 ##
 File path: src/parquet/statistics.cc
 ##
 @@ -96,6 +96,67 @@ void TypedRowGroupStatistics::Reset() {
   has_min_max_ = false;
 }
 
+template 
+inline int getValueBeginOffset(const T* values, int64_t count) {
+  return 0;
+}
+
+template 
+inline int getValueEndOffset(const T* values, int64_t count) {
+  return count;
+}
+
+template 
+inline bool notNaN (const T* value) {
+  return true;
+}
+
+template <>
+inline int getValueBeginOffset(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = 0; i < count; i++) {
+ if (!std::isnan(values[i])) return i;
+  }
+  return count;
+}
+
+template <>
+inline int getValueEndOffset(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = (count - 1); i > 0; i--) {
+ if (!std::isnan(values[i])) return (i + 1);
+  }
+  return count;
 
 Review comment:
   To me it seems it should be "return 0"


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371225#comment-16371225
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

boroknagyz commented on a change in pull request #444: [WIP] PARQUET-1225: NaN 
values may lead to incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#discussion_r169592598
 
 

 ##
 File path: src/parquet/statistics.cc
 ##
 @@ -96,6 +96,67 @@ void TypedRowGroupStatistics::Reset() {
   has_min_max_ = false;
 }
 
+template 
+inline int getValueBeginOffset(const T* values, int64_t count) {
+  return 0;
+}
+
+template 
+inline int getValueEndOffset(const T* values, int64_t count) {
+  return count;
+}
+
+template 
+inline bool notNaN (const T* value) {
+  return true;
+}
+
+template <>
+inline int getValueBeginOffset(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = 0; i < count; i++) {
+ if (!std::isnan(values[i])) return i;
+  }
+  return count;
+}
+
+template <>
+inline int getValueEndOffset(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = (count - 1); i > 0; i--) {
 
 Review comment:
   I think it should be "i >= 0" instead of "i > 0"
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-20 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370767#comment-16370767
 ] 

Deepak Majeti commented on PARQUET-1225:


[~boroknagyz] I opened a PR here 
[https://github.com/apache/parquet-cpp/pull/444] and made some comments. Let me 
know your feedback.

> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370762#comment-16370762
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

majetideepak opened a new pull request #444: PARQUET-1225: NaN values may lead 
to incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444
 
 
   …n circumstances


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370706#comment-16370706
 ] 

Zoltán Borók-Nagy commented on PARQUET-1225:


Hi [~mdeepak], the proposed quick fix for Impala write path is described in 
IMPALA-6542

The proposed solution is very close to what is described on the [parquet-cpp 
mailing 
list|https://lists.apache.org/thread.html/2f9afec11d6dc11d0d9613a3bfb64c0b32dad8ebfdc30fa4252a8ec1@%3Cdev.parquet.apache.org%3E],
 ie. basically ignore NaNs. There is only a small difference between the two 
proposals for the case when all the values are NaN. I don't have a strong 
opinion on this very special edge case, but I think the parquet-cpp and Impala 
behavior should be aligned.

> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-20 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370602#comment-16370602
 ] 

Deepak Majeti commented on PARQUET-1225:


Is Impala not handling the write path?

> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-20 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370546#comment-16370546
 ] 

Zoltan Ivanfi commented on PARQUET-1225:


Please note there already is a [review request for an Impala 
workaround|https://gerrit.cloudera.org/#/c/9358/]. I think it would beneficial 
to agree on a common approach in order to have Impala and parquet-cpp handle 
the problem consistently.

> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)