Re: Forward & Backwards Compatibility

2022-05-29 Thread Micah Kornfield
I'd be in favor of maybe adding a flag that allows this type of schema
which is by default false.  Another option is if we can identify the writer
of these files, we can make an exception specifically for those versions?

On Wed, May 18, 2022 at 4:02 PM William Butler 
wrote:

> >
> > Well, why is UNKNOWN used here? This seems like a bug in the producer:
>
> It is a bug in the producer, see the JIRA, but now there are a lot of these
> files out there in the wild. Adding an explicit workaround might be
> reasonable given the pervasiveness.
>
> On Mon, May 16, 2022 at 4:57 AM Antoine Pitrou  wrote:
>
> > On Thu, 12 May 2022 09:46:57 -0700
> > William Butler 
> > wrote:
> > >
> > > From the JIRA, the converted type looks something like
> > >
> > >   required group FeatureAmounts (MAP) {
> > > repeated group map (MAP_KEY_VALUE) {
> > >   required binary key (STRING);
> > >   required binary key (STRING);
> > > }
> > >   }
> > >
> > >
> > > but the logical type looks like
> > >
> > >   required group FeatureAmounts (MAP) {
> > > repeated group map (UNKNOWN) {
> > >   required binary key (STRING);
> > >   required binary key (STRING);
> > > }
> > >   }
> > >
> > > Parquet C++ does not like that the UNKNOWN/NullLogicalType is being
> used
> > in
> > > the groups and rejects the schema with an exception.
> >
> > Well, why is UNKNOWN used here? This seems like a bug in the producer:
> > if MAP_KEY_VALUE does not have an equivalent logical type, then no
> > logical type annotation should be produced, instead of the "UNKNOWN"
> > logical type annotation which means that all values are null and the
> > "real" type of the data is therefore lost.
> >
> > (I understand that this is probably due to confusion from the misnaming
> > of the "UNKNOWN" logical type, which would have been more appropriately
> > named "ALWAYS_NULL" or similar)
> >
> > > The second example involves an INT64 column with a TIMESTAMP_MILLIS
> > > converted type but a String logical type. Parquet-mr in this example
> > > fallbacks to the timestamp converted type whereas Parquet C++ throws an
> > > exception.
> >
> > Well, I don't know why a String logical type should be accepted for
> > integer columns with a timestamp converted type.  The fact that
> > parquet-mr accepts it sounds like a bug in parquet-mr, IMO.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>


[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

2022-05-29 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17543672#comment-17543672
 ] 

Micah Kornfield commented on PARQUET-1711:
--

the way one could handle this is allow users to specify a recursion limit, and 
expand the schema to that limit (this would need to be stored in metadata).  
For protos that end up deeper then that level you either error on write, or 
have the leaf levels of recursion store serialized proto bytes.

> [parquet-protobuf] stack overflow when work with well known json type
> -
>
> Key: PARQUET-1711
> URL: https://issues.apache.org/jira/browse/PARQUET-1711
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: Lawrence He
>Priority: Major
>
> Writing following protobuf message as parquet file is not possible: 
> {code:java}
> syntax = "proto3";
> import "google/protobuf/struct.proto";
> package test;
> option java_outer_classname = "CustomMessage";
> message TestMessage {
> map data = 1;
> } {code}
> Protobuf introduced "well known json type" such like 
> [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue]
>  to work around json schema conversion. 
> However writing above messages traps parquet writer into an infinite loop due 
> to the "general type" support in protobuf. Current implementation will keep 
> referencing 6 possible types defined in protobuf (null, bool, number, string, 
> struct, list) and entering infinite loop when referencing "struct".
> {code:java}
> java.lang.StackOverflowErrorjava.lang.StackOverflowError at 
> java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at 
> java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at 
> java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044)
>  at 
> java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)