Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-08-18 Thread via GitHub


zeroshade commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3197844878

   @alamb @lidavidm @wgtmac @emkornfield Can I get a review from any of you? 
Once the `parquet-testing` PR is merged, this can get merged and I'll update 
the docs for an arrow variant extension type.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-08-16 Thread via GitHub


zeroshade commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3194009152

   @aihuaxu I've updated the files and this PR with the relevant changes. The 
files generated by Go are in https://github.com/apache/parquet-testing/pull/94
   
   Please take a look at both the files and the code and let me know what you 
think. Thanks!
   
   I'll wait until the relevant PRs for `parquet-testing` are merged before 
this can get merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-08-15 Thread via GitHub


aihuaxu commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3192067160

   @zeroshade I'm clarifying the missing value field in 
https://github.com/apache/parquet-format/pull/512. Can you also take a look? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-08-14 Thread via GitHub


aihuaxu commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3190566954

   > @aihuaxu It appears all of the parquet files with Decimal values that you 
regenerated are inconsistent with the test cases. The values in the parquet 
files don't match the expected values in the variant (for example, containing 
`123456789` instead of `1234567890` and some other rounding differences. The 
original files had the correct values, but were just missing the Variant 
logical type.) Can you fix them please?
   
   For decimal values, the changes are expected to align with the spec - 
original 1234567890 should be treated as decimal8 rather than decimal4 since 
the values with 1~9 precision is stored as decimal4.
   
   
   > Also some cases haven't been given the new notes/descriptions, such as the 
cases which test for a missing value column (the spec wording still does not 
allow for the value column to be omitted)
   I didn't update for these test cases since missing value column can be 
omitted to indicate the value is null (I would like to clarify in the spec 
https://github.com/apache/parquet-format/pull/512/).
   
   Please take another look.  I consolidate the files into one place. Sorry 
that I didn't notice that. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-08-14 Thread via GitHub


zeroshade commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3190245256

   Also some cases haven't been given the new notes/descriptions, such as the 
cases which test for a missing value column (the spec wording still does not 
allow for the `value` column to be omitted)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-08-14 Thread via GitHub


zeroshade commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3190240955

   @aihuaxu It appears all of the parquet files with Decimal values that you 
regenerated are inconsistent with the test cases. The values in the parquet 
files don't match the expected values in the variant (for example, containing 
`123456789` instead of `1234567890` and some other rounding differences. The 
original files had the correct values, but were just missing the Variant 
logical type.) Can you fix them please? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-08-14 Thread via GitHub


aihuaxu commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3189856354

   I'm updating the test description for the ones mentioned in 
https://github.com/apache/parquet-testing/pull/91. @alamb and @zeroshade  Can 
you take a look? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-08-13 Thread via GitHub


aihuaxu commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3185758715

   > > There is a ask to create the test files from GO language and then test 
out from Parquet-Java side. Can you help generate the same set of the test 
files?
   > 
   > I can do so, except for the constructions which aren't valid according to 
the spec as the Go implementation doesn't allow for it.
   > 
   > This includes the confusion over required vs optional vs present etc. at a 
minimum, possibly taking @emkornfield's suggestion 
[here](https://github.com/apache/arrow-go/pull/455#issuecomment-3140838383) to 
at least add a glossary of terms or otherwise simply clarify the language based 
on the intent.
   > 
   > Is there a script that was used to generate the test cases? Or do I need 
to parse the values in the JSON to figure out what should go in the files?
   > 
   > In addition, where should I put the files?
   
   We generated those test files from Iceberg unit tests 
(https://github.com/apache/iceberg/pull/13654), not from a script. What I can 
think of is to expand this PR to read those files in tests and then rewrite out 
through GO-Variant writer logic. We can put in 
https://github.com/apache/parquet-testing repository. If we can have the same 
layout as https://github.com/apache/parquet-testing/pull/90. That will be great 
to make the test from Parquet-Java easier. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-08-13 Thread via GitHub


zeroshade commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3185614600

   > There is a ask to create the test files from GO language and then test out 
from Parquet-Java side. Can you help generate the same set of the test files?
   
   I can do so, except for the constructions which aren't valid according to 
the spec as the Go implementation doesn't allow for it. 
   
   This includes the confusion over required vs optional vs present etc. at a 
minimum, possibly taking @emkornfield's suggestion 
[here](https://github.com/apache/arrow-go/pull/455#issuecomment-3140838383) to 
at least add a glossary of terms or otherwise simply clarify the language based 
on the intent. 
   
   Is there a script that was used to generate the test cases? Or do I need to 
parse the values in the JSON to figure out what should go in the files?
   
   In addition, where should I put the files?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-08-13 Thread via GitHub


zeroshade commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3185587736

   @alamb I'm fine with that too, I just wanted to get some confirmation one 
way or the other. 
   
   If @rdblue or @aihuaxu are willing to update the test cases (either by 
removing or using your suggestion to flag tests that are invalid 
constructions), then that's fine with me. Mostly I want to just get this 
figured out, and we shouldn't have integration tests which are testing behavior 
that doesn't exist in the spec. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-08-13 Thread via GitHub


aihuaxu commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3185581973

   > > @aihuaxu any luck / possibility on getting updates to the parquet 
variant spec per my and @lidavidm's comments?
   > 
   > I think at this point where we are trying to finalize the Variant spec 
would be to update the tests cases rather than re-open a discussion that seems 
to have been discussed at length in the initial spec design
   > 
   > Just my $0.02
   
   That's also my understanding that those have been discussed during the 
design phase. We can update the test cases to indicate that engines can have 
different behavior. 
   
   BTW: @zeroshade  There is a ask to create the test files from GO language 
and then test out from Parquet-Java side. Can you help generate the same set of 
the test files? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-08-13 Thread via GitHub


alamb commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3185559645

   > @aihuaxu any luck / possibility on getting updates to the parquet variant 
spec per my and @lidavidm's comments?
   
   I think at this point where we are trying to finalize the Variant spec would 
be to update the tests cases rather than re-open a discussion that seems to 
have been discussed at length in the initial spec design 
   
   Just my $0.02


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-08-13 Thread via GitHub


zeroshade commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3184277268

   @aihuaxu any luck / possibility on getting updates to the parquet variant 
spec per my and @lidavidm's comments?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-08-06 Thread via GitHub


lidavidm commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3161939276

   Just chiming in as an arrow-go (but not parquet) contributor: I find the 
idea of test cases that are invalid and yet presented as real test cases rather 
unsettling; it feels like implementation- or vendor- specific behavior snuck 
its way in to what is supposed to be the official spec.
   
   > We definitely want read behavior to be predictable and have consistent 
behavior. If the desired consistent behavior isn't actually written into the 
spec, how would one know what that behavior should be?
   
   I think this is the crux of the issue: if Parquet does want to allow 
slightly out-of-spec files to be read it needs to be clearly defined in the spec


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-08-06 Thread via GitHub


zeroshade commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3161096586

   @rdblue @aihuaxu @emkornfield @wgtmac @cashmand 
   
   I've been thinking about this a bit more over the past few days and had the 
following thoughts. Please let me know what you all think.
   
   > The implementation I generated these cases from is defensive and tries to 
read if it can rather than producing errors. I'd recommend doing the same thing 
to handle outside-of-spec cases.
   
   The problem I see here is multi-faceted. While I understand the idea and 
reasons for doing this, allowing outside-of-spec things on read but not on 
write causes a few problems:
   
   * It explicitly creates scenarios that *cannot* be round-tripped. You can 
read things which don't conform to the spec, sure, but then you have to write a 
spec-conforming result. 
   * There's no incentive for bad writers to fix their issues. If the 
convention becomes that these invalid scenarios are allowed to be read, despite 
the spec not actually saying it should be allowed, then any writers which 
currently write invalid, non-spec-conforming files have no incentive to fix 
their issues. 
   * It significantly increases the complexity of implementing the spec. It 
puts the onus on implementors to deduce that they have to allow these cases 
despite the lack of language in the spec, not to mention the actual difficulty 
of any Arrow integration with Parquet to have to either allow these invalid 
constructions or perform conversions and casts to handle the required/optional 
differences and whether or not specific fields exist.
   
   > For instance, most of the time if a column is missing, most 
implementations will allow you to project a column of nulls. Extending this 
idea to Variant, it's reasonable to assume that a missing value column 
indicates a column of nulls and read accordingly instead of failing. The other 
cases are similar.
   
   This is true for compute engines, but not necessarily true for regular 
Parquet implementations. Generally, the idea of "projection" is an 
*engine-level* concept. Most implementations of parquet will simply error if 
you try to read a column that doesn't exist, letting the level above the file 
interactions decide upon projection.
   
   > The behavior in these cases was debated when we were working on the spec. 
We ultimately decided to disallow writers from producing them, but I think it 
is a best practice to ensure that the read behavior is predictable, accepts 
even slightly malformed cases, and has consistent behavior depending on the 
projected columns.
   
   I agree with this sentiment. We definitely want read behavior to be 
predictable and have consistent behavior. If the desired consistent behavior 
isn't actually written into the spec, how would one know what that behavior 
should be?
   
   > My preference would be to relax the spec for this issue. It doesn't seem 
like there's much benefit to enforcing it on the read side, and it's easy to 
imagine a writer failing to enforce it in some cases where it usually adds a 
typed_value to the schema, but not always.
   
   This touches on my ultimate point honestly. My conclusion ultimately is that 
one of two things should happen:
   
   1. The spec needs to be updated to *explicitly* state what the behavior 
should be in all of these test cases I have identified, and either have the 
wording updated accordingly. (Fields being `present` vs `required`, expressly 
stating whether certain fields *must* be `required`, *must* be `optional` or 
are allowed to be either, and so on).
   2. These test cases that fall outside the scope of the spec should not exist 
in `parquet-testing` as integration tests. The integration tests should only 
test what the spec *actually* states, not conventions of a particular 
implementation. If some behavior is expected to be consistent, then it should 
be written in the spec explicitly so that there's no confusion or ambiguity 
what the behavior is expected to be.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-08-06 Thread via GitHub


alamb commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3161092757

   > > The behavior in these cases was debated when we were working on the 
spec. We ultimately decided to disallow writers from producing them, but I 
think it is a best practice to ensure that the read behavior is predictable, 
accepts even slightly malformed cases, and has consistent behavior depending on 
the projected columns.
   > 
   > @rdblue I think it would be worthwhile to add wording to the spec which 
defines the consistent behavior desired for readers in these malformed cases 
rather than it becoming just a "de facto" standard based on convention. The 
wording can specify that while it's invalid for a writer to produce these that 
in those cases the spec defines what the reader behavior _should_ be.
   
   Here is a proposal for how to make it clearer in the tests that this is an 
invalid file:
   - https://github.com/apache/parquet-testing/pull/90#discussion_r2257912290


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-08-05 Thread via GitHub


zeroshade commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3155637604

   > The behavior in these cases was debated when we were working on the spec. 
We ultimately decided to disallow writers from producing them, but I think it 
is a best practice to ensure that the read behavior is predictable, accepts 
even slightly malformed cases, and has consistent behavior depending on the 
projected columns.
   
   @rdblue I think it would be worthwhile to add wording to the spec which 
defines the consistent behavior desired for readers in these malformed cases 
rather than it becoming just a "de facto" standard based on convention. The 
wording can specify that while it's invalid for a writer to produce these that 
in those cases the spec defines what the reader behavior *should* be.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-08-04 Thread via GitHub


rdblue commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3152748447

   @zeroshade, sorry for the confusion here. You're right about a lot of those 
test cases. They are not allowed by the spec. The implementation I generated 
these cases from is defensive and tries to read if it can rather than producing 
errors. I'd recommend doing the same thing to handle outside-of-spec cases.
   
   For instance, most of the time if a column is missing, most implementations 
will allow you to project a column of nulls. Extending this idea to Variant, 
it's reasonable to assume that a missing `value` column indicates a column of 
nulls and read accordingly instead of failing. The other cases are similar.
   
   The most confusing one is where there is a field in the `value` of a struct 
that is also a shredded field. The rationale here was that the shredded value 
should always take precedence because the shredded value may be read without 
the rest of the struct (if you're projecting `extract_value(var, "$['b']", 
...)`) and that the behavior should not change based on the Parquet column 
projection.
   
   The behavior in these cases was debated when we were working on the spec. We 
ultimately decided to disallow writers from producing them, but I think it is a 
best practice to ensure that the read behavior is predictable, accepts even 
slightly malformed cases, and has consistent behavior depending on the 
projected columns.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-08-03 Thread via GitHub


aihuaxu commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3148781059

   I made a small change to the spec to clarify 
(https://github.com/apache/parquet-format/pull/512/). The rest I feel should be 
aligned with what the spec described. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-08-01 Thread via GitHub


aihuaxu commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3145043151

   > `Test case 84, testShreddedObjectWithOptionalFieldStructs tests the 
schenario where the shredded fields of an object are listed as optional in the 
schema, but the spec states that they must be required. Thus, the Go 
implementation errors on this test as the spec says this is an error. 
Clarification is needed on if this is a valid test case.`
   > 
   > @rdblue This seems to be reasonable to error out for me to enforce it's 
required in the schema. Do you remember why we had such positive case?
   
   Another approach is to update the wording in the spec so the reader doesn't 
need to do such check. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-07-31 Thread via GitHub


emkornfield commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3140838383

   > I think we can update the wording "When typed_value is omitted, value must 
be required." => "When typed_value is omitted, value must be present.". both 
typed_value and value fields should always be optional but we are saying it's 
required and that is causing the confusion.
   
   I think the spec tried to be careful with wording, but there is a lot of 
semantic overlap between required/missing/optional.  Having a glossary for 
these terms and doing an audit of the spec to make sure they are used 
consistently would help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-07-31 Thread via GitHub


aihuaxu commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3140553874

   > > My preference would be to relax the spec for this issue. It doesn't seem 
like there's much benefit to enforcing it on the read side, and it's easy to 
imagine a writer failing to enforce it in some cases where it usually adds a 
`typed_value` to the schema, but not always.
   > 
   > +1
   > 
   > IIRC, these wordings are by purpose to not complicate the reader side 
implementation w.r.t. reading values and consuming column stats directly from 
`typed_value`.
   
   https://github.com/user-attachments/assets/4baf1afb-6856-473d-ba1d-673d4edc9d1d";
 />
   
   I think we can update the wording  "When typed_value is omitted, value must 
be required." => "When typed_value is omitted, value must be present.".  both 
typed_value and value fields should always be optional but we are saying it's 
required and that is causing the confusion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-07-30 Thread via GitHub


wgtmac commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3138388899

   > My preference would be to relax the spec for this issue. It doesn't seem 
like there's much benefit to enforcing it on the read side, and it's easy to 
imagine a writer failing to enforce it in some cases where it usually adds a 
`typed_value` to the schema, but not always.
   
   +1
   
   IIRC, these wordings are by purpose to not complicate the reader side 
implementation w.r.t. reading values and consuming column stats directly from 
`typed_value`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-07-30 Thread via GitHub


cashmand commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3138348439

   > Correct, the spec states that if the typed_value field is omitted, then 
the value field must be required, so Go errors if it is optional when the 
typed_value field is omitted causing this test case to fail.
   
   My preference would be to relax the spec for this issue. It doesn't seem 
like there's much benefit to enforcing it on the read side, and it's easy to 
imagine a writer failing to enforce it in some cases where it usually adds a 
`typed_value` to the schema, but not always.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-07-30 Thread via GitHub


aihuaxu commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3138264152

   `Test case 84, testShreddedObjectWithOptionalFieldStructs tests the 
schenario where the shredded fields of an object are listed as optional in the 
schema, but the spec states that they must be required. Thus, the Go 
implementation errors on this test as the spec says this is an error. 
Clarification is needed on if this is a valid test case.`
   
   @rdblue  This seems to be reasonable to error out for me to enforce it's 
required in the schema. Do you remember why we had such positive case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-07-30 Thread via GitHub


zeroshade commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3137832489

   > Both value and typed_value are optional per spec and value can be missing 
as I understand.
   
   While the spec states that `typed_value` may be omitted, it does not say the 
same about `value`. If the intent is that either can be omitted, the spec 
should be updated with that wording. 
   
   > `The value column of a partially shredded object must never contain fields 
represented by the Parquet columns in typed_value (shredded fields). Readers 
may always assume that data is written correctly and that shredded fields in 
typed_value are not present in value.` This test case is to prove that the 
reader will only read from `typed_value` and ignore the one from `value`. That 
means, the reader is not responsible to validate the duplicate key and the 
reader will read from `typed_value`.
   
   The section you quoted states that the partially shredded object *must 
never* contain the fields and that a reader *may assume* that shredded fields 
aren't present in the `value` field. It also states that the reason why they 
must never be written that way is because it can result in inconsistent reader 
behavior. If the intent is for a reader to *always* read from *only* the 
`typed_value` field in the case of a conflict like this, then the language in 
the spec should be updated accordingly instead of the current "may" language. 
   
   > We will generate the schema first which will have both `value `and 
`typed_value` optional. But a `value` is to be shredded, the `value` column may 
be required. Do we fail in GO that `value` schema is optional?
   
   Correct, the spec states that if the `typed_value` field is omitted, then 
the `value` field *must* be required, so Go errors if it is optional when the 
`typed_value` field is omitted causing this test case to fail.
   
   > This is same as test case 43. My understanding is that if writer writes 
wrong data, the reader may only read the `typed_value`.
   
   The spec says that's a *valid* thing to do, but it also says that this *must 
never happen* and doesn't definitively state what the behavior in this case 
should be. Only that it may be inconsistent. As I said above, if the intent is 
that the data in the `typed_value` field is given precedence, the spec should 
be updated to say that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-07-30 Thread via GitHub


aihuaxu commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3137794519

   > Several test cases test variations on situations where the value column is 
missing. Based on my reading of the 
[spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md)
 this seems to be an invalid scenario. The specific case is that the spec 
states the typed_value field may be omitted when not shredding elements as a 
specific type, but says nothing about allowing omission of the value field. 
Currently, the Go implementation will error if this field is missing as per my 
reading of the spec, meaning those test cases fail.
   
   Both `value` and `typed_value` are optional per spec and `value` can be 
missing as I understand.
   https://github.com/user-attachments/assets/32432ac1-edea-477b-81c0-967dd9304115";
 />
   

   > Test case 43 testPartiallyShreddedObjectMissingFieldConflict seems to have 
a conflict between what is expected and what in the spec. The b field exists 
within the value field, while also being a shredded field, the test appears to 
assume the data in the value field would be ignored, but the 
[spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#objects)
 says that value must never contain fields represented by the shredded fields. 
This needs clarification on the desired behavior and result.
   
   `The value column of a partially shredded object must never contain fields 
represented by the Parquet columns in typed_value (shredded fields). Readers 
may always assume that data is written correctly and that shredded fields in 
typed_value are not present in value.`  This test case is to prove that the 
reader will only read from `typed_value` and ignore the one from `value`. That 
means, the reader is not responsible to validate the duplicate key and the 
reader will read from `typed_value`.
   
   
   > Test case 84, testShreddedObjectWithOptionalFieldStructs tests the 
schenario where the shredded fields of an object are listed as optional in the 
schema, but the spec states that they must be required. Thus, the Go 
implementation errors on this test as the spec says this is an error. 
Clarification is needed on if this is a valid test case.
   
   This seems to be reasonable to error out for me to enforce it's required in 
the schema. 
   
   ```
   required group email {
 optional binary value;
 optional binary typed_value (STRING);
   }
   
   ```
   
   > Test case 38 testShreddedObjectMissingTypedValue tests the case where the 
typed_value field is missing, this is allowed by the spec except that the spec 
states that in this scenario the value field must be required. The test case 
uses optional in this scenario causing the Go implementation to fail. 
Clarification is needed here.
   
   We will generate the schema first which will have both `value` and 
`typed_value` optional. But a value is to be shredded, the `value`  column may 
be required. Do we fail in GO that `value` schema is optional?  
   
   
   > Test case 125, testPartiallyShreddedObjectFieldConflict again tests the 
case of a field existing in both the value and the shredded column which the 
spec states is invalid and will lead to inconsistent results. Thus it is not 
valid to have this test case assert a specific result according to the spec 
unless the spec is amended to state that the shredded field takes precedence in 
this case.
   
This is same as test case 43. My understanding is that if writer writes 
wrong data, the reader may only read the `typed_value`.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-07-30 Thread via GitHub


zeroshade commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3137763209

   I'm away for the rest of this week but I can retest on Tuesday. The 
regeneration of the tests wouldn't fix the rest of the issues I listed right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-07-30 Thread via GitHub


aihuaxu commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3137427446

   @zeroshade  I regenerated the files with Variant logical type 
(https://github.com/apache/parquet-testing/pull/91). Can you retest it? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET [arrow-go]

2025-07-28 Thread via GitHub


zeroshade commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3129829660

   CC @aihuaxu @emkornfield @rdblue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]