Hi Chao,
Great to hear that you are pushing this forward.
Apologies for not forwarding the email thread earlier.
I will try fixing some issues of the milestone 1, so that we could have the
read part complete.
Cheers,
Ivan
On Fri, 16 Feb 2018 at 5:33 PM, Chao Sun wrote:
>
[
https://issues.apache.org/jira/browse/PARQUET-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366947#comment-16366947
]
Luke Higgins commented on PARQUET-1122:
---
similarly, I get that scan_contents returns 0 as number
[
https://issues.apache.org/jira/browse/PARQUET-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366951#comment-16366951
]
Luke Higgins commented on PARQUET-1122:
---
Is there any other setting I can give our team that uses
Hi,
I don't think it would be worth to keep a separate NaN count, but we could
ignore them when calculating min/max stats regardless. However, NaN is not
the only value preventing total ordering. We could have a similar problem
with not finding +0.0 values because a -0.0 is written to the
[
https://issues.apache.org/jira/browse/PARQUET-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gabor Szadovszky updated PARQUET-1217:
--
Summary: Incorrect handling of missing values in Statistics (was: Missing
check for
I hope the common cases is that data files do not contain these special
float values. As the simplest solution, how about writers refrain from
populating the stats if a special value is encountered?
That fix does not preclude a more thorough solution in the future, but it
addresses the common
> We could have a similar problem
> with not finding +0.0 values because a -0.0 is written to the max_value
> field by some component that considers them the same.
My hope is that the filtering would behave sanely, since -0.0 == +0.0
under the real-number-inspired ordering, which is distinguished
There is an extensibility mechanism with the ColumnOrder union - I think
that was meant to avoid the need to add new stat fields?
Given that the bug was in the Parquet spec, we'll need to make a spec
change anyway, so we could add a new ColumnOrder - FloatingPointTotalOrder?
at the same time as
[
https://issues.apache.org/jira/browse/PARQUET-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gabor Szadovszky updated PARQUET-1217:
--
Description:
As per the parquet-format specs the min/max values in statistics are
[
https://issues.apache.org/jira/browse/PARQUET-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367763#comment-16367763
]
Wes McKinney commented on PARQUET-1122:
---
I am unable to investigate in detail right now, someone
I don't see a major benefit to a temporary solution. The files are already
out there and we need to implement a fix on the read path regardless. If we
keep writing the stats there's at least some information contained in the
stats that readers can make use of, if they want to implement the
Yeah, I missed that. We set it per column, so all other types could keep
TypeDefinedOrder and floats could have something like NanAwareDoubleOrder.
On Fri, Feb 16, 2018 at 9:18 AM, Tim Armstrong
wrote:
> We wouldn't need to rev the whole TypeDefinedOrder thing right?
I think one idea behind the column order fields was that if a reader does
not recognize a value there, it needs to ignore the stats. If I remember
correctly, that was intended to allow us to add new orderings for
collations, but it also seems useful to address gaps in the spec or known
broken
On Fri, Feb 16, 2018 at 9:15 AM, Tim Armstrong
wrote:
> I don't see a major benefit to a temporary solution. The files are already
> out there and we need to implement a fix on the read path regardless. If we
> keep writing the stats there's at least some information
We wouldn't need to rev the whole TypeDefinedOrder thing right? Couldn't we
just define a special order for floats? Essentially it would be a tag for
writers to say "hey I know about this total order thing".
On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker wrote:
> I think one
The reader still can't correctly interpret those stats without knowing
about the behaviour of that specific writer though, because it can't assume
the absence of NaNs unless it knows that they are reading a file written by
a writer that drops stats when it sees NaNs.
It *could* fix the behaviour
I would just like to mention that the fmax() / fmin() functions in C/C++
Math library follow the aforementioned IEEE 754-2008 min and max
specification:
http://en.cppreference.com/w/c/numeric/math/fmax
I think this behavior is also the most intuitive and useful regarding to
statistics. If we want
On Fri, Feb 16, 2018 at 9:38 AM, Tim Armstrong
wrote:
> The reader still can't correctly interpret those stats without knowing
> about the behaviour of that specific writer though, because it can't assume
> the absence of NaNs unless it knows that they are reading a file
On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy
wrote:
> I would just like to mention that the fmax() / fmin() functions in C/C++
> Math library follow the aforementioned IEEE 754-2008 min and max
> specification:
> http://en.cppreference.com/w/c/numeric/math/fmax
>
>
19 matches
Mail list logo