[jira] [Commented] (PARQUET-1222) Definition of float and double sort order is ambigious

2018-02-22 Thread Jim Apple (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373429#comment-16373429
 ] 

Jim Apple commented on PARQUET-1222:


I do not think the order proposed matches the IEEE 754 totalOrder. It would be 
good to match that order so as to avoid creating yet one more thing for users 
to track. Also, it is possible that the IEEE totalOrder is also faster in some 
hardware.

I haven't checked, but it is possible that IEEE-754 is actually a 
lexicographical compare of the bit representations in some endianess or another.

> Definition of float and double sort order is ambigious
> --
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
> Fix For: format-2.5.0
>
> Attachments: ordering.png
>
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. The proposed ordering is 
> the following:
>  * -∞
>  * negative numbers in their natural order
>  * -0 and +0 in the same equivalence class \(!)
>  * positive numbers in their natural order
>  * +∞
>  * all NaN values, including the negative ones \(!), in the same equivalence 
> class \(!)
> This ordering should be effective and easy to implement in all programming 
> languages. A visual representation of the ordering of some example values:
> !ordering.png|width=640px!
> For reading existing stats created using TypeDefinedOrder, the following 
> compatibility rules should be applied:
> * When looking for NaN values, min and max should be ignored.
> * If the min is a NaN, it should be ignored.
> * If the max is a NaN, it should be ignored.
> * If the min is \+0, the row group may contain -0 values as well.
> * If the max is -0, the row group may contain \+0 values as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1225) NaN values may lead to incorrect filtering under certain circumstances

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372978#comment-16372978
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-

boroknagyz commented on issue #444: PARQUET-1225: NaN values may lead to 
incorrect filtering under certai…
URL: https://github.com/apache/parquet-cpp/pull/444#issuecomment-367726337
 
 
   Thanks for applying the changes! To me it looks good generally.
   The Update() and UpdateSpaced() functions share some common code parts (not 
necessarily introduced by this PR, e.g. the last if statements of the 
functions), maybe worth a little refactoring.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> --
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Zoltan Ivanfi
>Assignee: Deepak Majeti
>Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache Parquet C++ 1.4.0 RC0

2018-02-22 Thread Zoltan Borok-Nagy
OK, thanks!

On Wed, Feb 21, 2018 at 7:33 PM, Deepak Majeti 
wrote:

> Yes! The min/max will be set to NaN in the case when all the values are
> NaN.
>
> On Wed, Feb 21, 2018 at 10:54 AM, Zoltan Borok-Nagy <
> borokna...@cloudera.com
> > wrote:
>
> > Deepak, just for clarification, does it mean that parquet-cpp will also
> > write statistics when all the values are NaN?
> >
> >
> > On Wed, Feb 21, 2018 at 1:16 PM, Deepak Majeti 
> > wrote:
> >
> > > I am okay with this proposed fix for Impala.
> > >
> > > On Tue, Feb 20, 2018 at 5:46 PM, Zoltan Borok-Nagy <
> > > borokna...@cloudera.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm implementing the quick fix for Impala. The current proposal for
> the
> > > > write path fix is to behave like the fmax()/fmin() functions in
> math.h,
> > > ie.
> > > > ignore NaNs, except for the case when all the values are NaN.
> > > >
> > > > http://en.cppreference.com/w/c/numeric/math/fmax
> > > > https://issues.apache.org/jira/browse/IMPALA-6542
> > > >
> > > > But, it is also OK for me if you guys think that we should not write
> > > > statistics at all when all the values are Nan. I just think that the
> > > chosen
> > > > behavior should be identical.
> > > >
> > > > BR,
> > > > Zoltan-BN
> > > >
> > > >
> > > >
> > > > On Tue, Feb 20, 2018 at 5:57 PM, Uwe L. Korn 
> wrote:
> > > >
> > > > > Due to the issues raised, I will close this RC and once all patches
> > are
> > > > > merged, I will build a new one.
> > > > >
> > > > > Uwe
> > > > >
> > > > > On Tue, Feb 20, 2018, at 1:48 AM, Deepak Majeti wrote:
> > > > > > Wes, Zoltan,
> > > > > >
> > > > > > I am taking a look at the issue now. I will handle the patch for
> > this
> > > > > one.
> > > > > > Thanks!
> > > > > >
> > > > > > On Tue, Feb 20, 2018 at 12:54 AM, Wes McKinney <
> > wesmck...@gmail.com>
> > > > > wrote:
> > > > > > > hi Zoltan -- my quick read is that one appropriate fix in
> > > parquet-cpp
> > > > > > > would be to exclude NaN values from statistics calculations
> > (there
> > > is
> > > > > > > also the case that the whole row group is NaN for a column, in
> > > which
> > > > > > > case we should not write statistics perhaps?)? This might not
> > take
> > > > too
> > > > > > > long to fix in parquet-cpp, and we have some other patches up
> > that
> > > we
> > > > > > > could merge in as well.
> > > > > > >
> > > > > > > Deepak, Phillip, or Uwe do you have any time to look at this? I
> > can
> > > > > > > also make time to look
> > > > > > >
> > > > > > > Thanks
> > > > > > > Wes
> > > > > > >
> > > > > > > On Mon, Feb 19, 2018 at 9:39 AM, Zoltan Ivanfi <
> z...@cloudera.com>
> > > > > wrote:
> > > > > > >> Hi,
> > > > > > >>
> > > > > > >> I wonder whether the fix for PARQUET-1225
> > > > > > >>  should
> be
> > > > > included in
> > > > > > >> the next release, even if it causes a delay.
> > > > > > >>
> > > > > > >> Br,
> > > > > > >>
> > > > > > >> Zoltan
> > > > > > >>
> > > > > > >> On Sun, Feb 18, 2018 at 10:10 PM Uwe L. Korn <
> uw...@xhochy.com>
> > > > > wrote:
> > > > > > >>
> > > > > > >>> +1 (binding)
> > > > > > >>>
> > > > > > >>> verified on Ubuntu 16.04
> > > > > > >>> verified on macOS High Sierra but needed to set the following
> > env
> > > > > vars to
> > > > > > >>> get Thrift 0.11 building:
> > > > > > >>>
> > > > > > >>> export OPENSSL_ROOT_DIR=/usr/local/Cellar/openssl/1.0.2n
> > > > > > >>> export PATH="/usr/local/opt/bison/bin:$PATH"
> > > > > > >>>
> > > > > > >>> On Sun, Feb 18, 2018, at 10:09 PM, Uwe L. Korn wrote:
> > > > > > >>> > All,
> > > > > > >>> >
> > > > > > >>> > I propose that we accept the following release candidate as
> > the
> > > > > official
> > > > > > >>> > Apache Parquet C++ 1.4.0 release.
> > > > > > >>> >
> > > > > > >>> > Parquet C++ 1.4.0-rc0 includes the following:
> > > > > > >>> > ---
> > > > > > >>> > The CHANGELOG for the release is available at:
> > > > > > >>> >
> > > > > > >>> https://gitbox.apache.org/repos/asf?p=parquet-cpp.git=
> > > > > CHANGELOG=apache-parquet-cpp-1.4.0-rc0
> > > > > > >>> >
> > > > > > >>> > The tag used to create the release candidate is:
> > > > > > >>> >
> > > > > > >>> https://gitbox.apache.org/repos/asf?p=parquet-cpp.git;a=
> > > > > shortlog;h=refs/tags/apache-parquet-cpp-1.4.0-rc0
> > > > > > >>> >
> > > > > > >>> > The release candidate is available at:
> > > > > > >>> >
> > > > > > >>> https://dist.apache.org/repos/dist/dev/parquet/apache-
> > > > > parquet-cpp-1.4.0-rc0/apache-parquet-cpp-1.4.0.tar.gz
> > > > > > >>> >
> > > > > > >>> > The MD5 checksum of the release candidate can be found at:
> > > > > > >>> >
> > > > > > >>> https://dist.apache.org/repos/dist/dev/parquet/apache-
> > > > > parquet-cpp-1.4.0-rc0/apache-parquet-cpp-1.4.0.tar.gz.md5
> > > > > > >>> >
> > > > > > >>> > The signature of the release candidate can be found at:
> > > > > > >>> >
> > > > > > >>> 

[jira] [Commented] (PARQUET-1234) Release Parquet format 2.5.0

2018-02-22 Thread Gidon Gershinsky (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372744#comment-16372744
 ] 

Gidon Gershinsky commented on PARQUET-1234:
---

[~zi], I see, thanks. I will hope for a faster timeline :)

> Release Parquet format 2.5.0
> 
>
> Key: PARQUET-1234
> URL: https://issues.apache.org/jira/browse/PARQUET-1234
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Affects Versions: format-2.5.0
>Reporter: Gabor Szadovszky
>Priority: Major
> Fix For: format-2.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1234) Release Parquet format 2.5.0

2018-02-22 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372729#comment-16372729
 ] 

Zoltan Ivanfi commented on PARQUET-1234:


[~gershinsky], some PRs can spend several weeks in the review phase. Of course, 
if it is accepted quickly, it will be included in the release.

> Release Parquet format 2.5.0
> 
>
> Key: PARQUET-1234
> URL: https://issues.apache.org/jira/browse/PARQUET-1234
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Affects Versions: format-2.5.0
>Reporter: Gabor Szadovszky
>Priority: Major
> Fix For: format-2.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1234) Release Parquet format 2.5.0

2018-02-22 Thread Gidon Gershinsky (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372690#comment-16372690
 ] 

Gidon Gershinsky commented on PARQUET-1234:
---

[~zi], the PR for PARQUET-1227 is in, and the code for PARQUET-1228 is ready 
(I'll send a PR for it after the first PR is reviewed). So technically, this 
feature can be included in parquet-format any time before the end of next week, 
similar to PARQUET-1222. But, as you said, this is up to the community.

> Release Parquet format 2.5.0
> 
>
> Key: PARQUET-1234
> URL: https://issues.apache.org/jira/browse/PARQUET-1234
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Affects Versions: format-2.5.0
>Reporter: Gabor Szadovszky
>Priority: Major
> Fix For: format-2.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PARQUET-1234) Release Parquet format 2.5.0

2018-02-22 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372577#comment-16372577
 ] 

Zoltan Ivanfi edited comment on PARQUET-1234 at 2/22/18 9:39 AM:
-

[~rdblue], I suggest we decide this on the Parquet sync next week.

I hope PARQUET-1222 will not take that long to resolve as I already have a 
proposal in the JIRA description and I would like to create a PR early next 
week. Then we can discuss it in the Parquet sync and if everyone agrees to the 
proposal, we may resolve that issue by end of next week.

If, on the other hand, the community will raise problems, concerns, or new 
alternatives, the fix will take more time, and we may consider an earlier 
release as well. However, even if we don't introduce a new ordering in the next 
release, we should still clarify the specification by explicitly calling out 
this ambiguity and the countermeasures that have to be taken to deal with the 
consequences (basically the bullet points at the end of the description of 
PARQUET-1222). And I think this is a significant change in itself that may 
deserve a minor version change.

So we do not necessarily have to block the release on a proper solution for 
PARQUET-1222 if we expect it to take a long time, but in my opinion we should 
at least address that issue partially by blocking on a new JIRA that is a 
subset of PARQUET-1222.


was (Author: zi):
[~rdblue], I suggest we decide this on the Parquet sync next week.

I hope PARQUET-1222 will not take that long to resolve as I already have a 
proposal in the JIRA description and I would like to create a PR early next 
week. Then we can discuss it in the Parquet sync and if everyone agrees to the 
proposal, we may resolve that issue by end of next week.

If, on the other hand, the community will raise problems, concerns, or new 
alternatives, the fix will take more time, and we may consider an earlier 
release as well. However, even if we don't introduce a new ordering in the next 
release, we should still clarify the specification by explicitly calling out 
this ambiguity and the countermeasures that have to be taken to deal with the 
consequences (basically the bullet points at the end of the description of 
PARQUET-1222). And I think this is a significant change in itself that may 
deserve a minor version change. 

> Release Parquet format 2.5.0
> 
>
> Key: PARQUET-1234
> URL: https://issues.apache.org/jira/browse/PARQUET-1234
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Affects Versions: format-2.5.0
>Reporter: Gabor Szadovszky
>Priority: Major
> Fix For: format-2.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PARQUET-1234) Release Parquet format 2.5.0

2018-02-22 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372577#comment-16372577
 ] 

Zoltan Ivanfi edited comment on PARQUET-1234 at 2/22/18 9:31 AM:
-

[~rdblue], I suggest we decide this on the Parquet sync next week.

I hope PARQUET-1222 will not take that long to resolve as I already have a 
proposal in the JIRA description and I would like to create a PR early next 
week. Then we can discuss it in the Parquet sync and if everyone agrees to the 
proposal, we may resolve that issue by end of next week.

If, on the other hand, the community will raise problems, concerns, or new 
alternatives, the fix will take more time, and we may consider an earlier 
release as well. However, even if we don't introduce a new ordering in the next 
release, we should still clarify the specification by explicitly calling out 
this ambiguity and the countermeasures that have to be taken to deal with the 
consequences (basically the bullet points at the end of the description of 
PARQUET-1222). And I think this is a significant change in itself that may 
deserve a minor version change. 


was (Author: zi):
[~rdblue], I suggest we decide this on the Parquet sync next week.

I hope PARQUET-1222 will not take that long to resolve as I already have a 
proposal in the JIRA description and I would like to create a PR early next 
week. Then we can discuss it in the Parquet sync and if everyone agrees to the 
proposal, we may resolve that issue by end of next week.

If, on the other hand, the community will raise problems, concerns, or new 
alternatives, the fix will take more time, and we may consider an earlier 
release as well. However, even if we don't introduce a new ordering in the next 
release, we should still clarify the specification by explicitly calling out 
this ambiguity and the countermeasures that have to be taken to deal with the 
consequences (basically the last bullet list in the description of 
PARQUET-1222). And I think this is a significant change in itself that may 
deserve a minor version change. 

> Release Parquet format 2.5.0
> 
>
> Key: PARQUET-1234
> URL: https://issues.apache.org/jira/browse/PARQUET-1234
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Affects Versions: format-2.5.0
>Reporter: Gabor Szadovszky
>Priority: Major
> Fix For: format-2.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)