[
https://issues.apache.org/jira/browse/PARQUET-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382390#comment-17382390
]
Weston Pace commented on PARQUET-2068:
--
I'm not sure if this is a better fit for the Arrow
Weston Pace created PARQUET-2068:
Summary: [C++] [Parquet] Use arrow compute to determine min/max of
dictionaries (possibly other arrays?)
Key: PARQUET-2068
URL: https://issues.apache.org/jira/browse/PARQUET-2068
I agree this is non-intuitive based on field names but seems consistent
with the text noted below (15 values are present and only 11 are written).
It seems another way of defining the value for this field would be number
of definition levels written that aren't less than the max definition level?
Hi Jorge,
Spark (similarly to other jvm based implementations) are most probably
using parquet-mr. parquet-mr counts null values independently from the
level in the structure. An additional twist here is we cannot store empty
lists but null lists (when the list itself is null) if it is optional.
(Branching from the previous discussion, as Micah pointed out another
interesting aspect)
Consider the list
[[0, 1], None, [2, None, 3], [4, 5, 6], [], [7, 8, 9], None, [10]]
for the schema
optional group column1 (LIST) {
repeated group list {
optional int32 element;
}
}
When looking
[
https://issues.apache.org/jira/browse/PARQUET-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381968#comment-17381968
]
Akshay Sundarraj commented on PARQUET-2065:
---
[~gszadovszky] Thanks for the reply.
I tried
[
https://issues.apache.org/jira/browse/PARQUET-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akshay Sundarraj updated PARQUET-2065:
--
Attachment: sample.schema
> parquet-cli not working in release 1.12.0
>
[
https://issues.apache.org/jira/browse/PARQUET-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akshay Sundarraj updated PARQUET-2065:
--
Attachment: sample.parquet
> parquet-cli not working in release 1.12.0
>
[
https://issues.apache.org/jira/browse/PARQUET-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381910#comment-17381910
]
Gabor Szadovszky commented on PARQUET-2065:
---
I've checked this with 1.11.0 and is
Note I moved the Arrow JIRA under parquet since I think this only affects
the core-parquet part of the implementation. I also created PARQUET-2067
to track the incorrect null counts (this might actually touch some arrow
code but I did this for consistency).
Thanks,
Micah
On Thu, Jul 15, 2021 at
Micah Kornfield created PARQUET-2067:
Summary: [C++] null_count and num_nulls incorrect for repeated
columns
Key: PARQUET-2067
URL: https://issues.apache.org/jira/browse/PARQUET-2067
Project:
[
https://issues.apache.org/jira/browse/PARQUET-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Micah Kornfield moved ARROW-13349 to PARQUET-2066:
--
Component/s: (was: Parquet)
(was:
Yeah I guess we only ever write 4 values for the example so even though the
wording is strange in num_values = 6 (which I don't think anyone is
debating it must be 2). Still a little confusing.
On Thu, Jul 15, 2021 at 11:43 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:
> Thanks,
Thanks, that was exactly what I was looking for.
I do think we could offer this or other examples in the spec to make it
clear what they represent (including the null count).
I filled ARROW-13349 to track the pyarrow discrepancy.
Best,
Jorge
On Thu, Jul 15, 2021 at 7:28 PM Micah Kornfield
14 matches
Mail list logo