[jira] [Commented] (PARQUET-2068) [C++] [Parquet] Use arrow compute to determine min/max of dictionaries (possibly other arrays?)

2021-07-16 Thread Weston Pace (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382390#comment-17382390 ] Weston Pace commented on PARQUET-2068: -- I'm not sure if this is a better fit for the Arrow

[jira] [Created] (PARQUET-2068) [C++] [Parquet] Use arrow compute to determine min/max of dictionaries (possibly other arrays?)

2021-07-16 Thread Weston Pace (Jira)
Weston Pace created PARQUET-2068: Summary: [C++] [Parquet] Use arrow compute to determine min/max of dictionaries (possibly other arrays?) Key: PARQUET-2068 URL: https://issues.apache.org/jira/browse/PARQUET-2068

Re: statistics null count in nested types

2021-07-16 Thread Micah Kornfield
I agree this is non-intuitive based on field names but seems consistent with the text noted below (15 values are present and only 11 are written). It seems another way of defining the value for this field would be number of definition levels written that aren't less than the max definition level?

Re: statistics null count in nested types

2021-07-16 Thread Gabor Szadovszky
Hi Jorge, Spark (similarly to other jvm based implementations) are most probably using parquet-mr. parquet-mr counts null values independently from the level in the structure. An additional twist here is we cannot store empty lists but null lists (when the list itself is null) if it is optional.

statistics null count in nested types

2021-07-16 Thread Jorge Cardoso Leitão
(Branching from the previous discussion, as Micah pointed out another interesting aspect) Consider the list [[0, 1], None, [2, None, 3], [4, 5, 6], [], [7, 8, 9], None, [10]] for the schema optional group column1 (LIST) { repeated group list { optional int32 element; } } When looking

[jira] [Commented] (PARQUET-2065) parquet-cli not working in release 1.12.0

2021-07-16 Thread Akshay Sundarraj (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381968#comment-17381968 ] Akshay Sundarraj commented on PARQUET-2065: --- [~gszadovszky] Thanks for the reply. I tried

[jira] [Updated] (PARQUET-2065) parquet-cli not working in release 1.12.0

2021-07-16 Thread Akshay Sundarraj (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akshay Sundarraj updated PARQUET-2065: -- Attachment: sample.schema > parquet-cli not working in release 1.12.0 >

[jira] [Updated] (PARQUET-2065) parquet-cli not working in release 1.12.0

2021-07-16 Thread Akshay Sundarraj (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akshay Sundarraj updated PARQUET-2065: -- Attachment: sample.parquet > parquet-cli not working in release 1.12.0 >

[jira] [Commented] (PARQUET-2065) parquet-cli not working in release 1.12.0

2021-07-16 Thread Gabor Szadovszky (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381910#comment-17381910 ] Gabor Szadovszky commented on PARQUET-2065: --- I've checked this with 1.11.0 and is

Re: num_values vs num_rows vs num_nulls

2021-07-16 Thread Micah Kornfield
Note I moved the Arrow JIRA under parquet since I think this only affects the core-parquet part of the implementation. I also created PARQUET-2067 to track the incorrect null counts (this might actually touch some arrow code but I did this for consistency). Thanks, Micah On Thu, Jul 15, 2021 at

[jira] [Created] (PARQUET-2067) [C++] null_count and num_nulls incorrect for repeated columns

2021-07-16 Thread Micah Kornfield (Jira)
Micah Kornfield created PARQUET-2067: Summary: [C++] null_count and num_nulls incorrect for repeated columns Key: PARQUET-2067 URL: https://issues.apache.org/jira/browse/PARQUET-2067 Project:

[jira] [Moved] (PARQUET-2066) [C++][Parquet] num_rows is incorrect for nested types

2021-07-16 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield moved ARROW-13349 to PARQUET-2066: -- Component/s: (was: Parquet) (was:

Re: num_values vs num_rows vs num_nulls

2021-07-16 Thread Micah Kornfield
Yeah I guess we only ever write 4 values for the example so even though the wording is strange in num_values = 6 (which I don't think anyone is debating it must be 2). Still a little confusing. On Thu, Jul 15, 2021 at 11:43 PM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > Thanks,

Re: num_values vs num_rows vs num_nulls

2021-07-16 Thread Jorge Cardoso Leitão
Thanks, that was exactly what I was looking for. I do think we could offer this or other examples in the spec to make it clear what they represent (including the null count). I filled ARROW-13349 to track the pyarrow discrepancy. Best, Jorge On Thu, Jul 15, 2021 at 7:28 PM Micah Kornfield