[GitHub] [incubator-hudi] umehrot2 commented on issue #1406: [HUDI-713] Fix conversion of Spark array of struct type to Avro schema
umehrot2 commented on issue #1406: [HUDI-713] Fix conversion of Spark array of struct type to Avro schema URL: https://github.com/apache/incubator-hudi/pull/1406#issuecomment-604137304 > Sorry did not mean to hijack this fix.. Just trying to understand how it ll break compatibility while we are here.. All this schema namespace business is only before writing parquet files right... Once you are able to write parquet, it should be readable by parquet-avro for merging? (which has nothing to do with apache-spark-avro or databricks-spark-avro)... what causes the breakage? All I can think of is, since the old namespace is stored in the `parquet.avro.schema` in the actual parquet file, it might conflict with the new schema that has a different namespace. @zhedoubushishi is looking into this. One good thing is that atleast it should not affect user's using `FileBaseSchemaProvider` or `SchemaRegistryProvider` with `DeltaStreamer` in which case from what I see we directly use the schema that user has passed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] umehrot2 commented on issue #1406: [HUDI-713] Fix conversion of Spark array of struct type to Avro schema
umehrot2 commented on issue #1406: [HUDI-713] Fix conversion of Spark array of struct type to Avro schema URL: https://github.com/apache/incubator-hudi/pull/1406#issuecomment-602933007 > > > So anyone who has written data using databricks-avro will face issues reading. > > By this you mean, reading for merging data (i.e during ingestion/writing) or querying via Spark/Hive/Presto? Yeah I mean writing additional data using `spark-avro` on top of old table written with data-bricks avro. Querying should not be affected. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] umehrot2 commented on issue #1406: [HUDI-713] Fix conversion of Spark array of struct type to Avro schema
umehrot2 commented on issue #1406: [HUDI-713] Fix conversion of Spark array of struct type to Avro schema URL: https://github.com/apache/incubator-hudi/pull/1406#issuecomment-602846762 > LGTM overall.. > > @umehrot2 @zhedoubushishi generally speaking, this schema namespace mismatch.. is this a backwards incompatible change.. i.e if we people have written data using 0.5.1, could they use master/0.6.0 to read and write without pain? @vinothchandar with 0.5.1 currently you cannot even write some of these complex data types like Array or structs etc. So this is actually a fix, and is not backwards incompatible with 0.5.1 since it uses `spark-avro`. However, it will be backwards incompatible with `databricks-avro`. So anyone who has written data using `databricks-avro` will face issues reading. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] umehrot2 commented on issue #1406: [HUDI-713] Fix conversion of Spark array of struct type to Avro schema
umehrot2 commented on issue #1406: [HUDI-713] Fix conversion of Spark array of struct type to Avro schema URL: https://github.com/apache/incubator-hudi/pull/1406#issuecomment-599776958 > @umehrot2 are you interested in reviewing this? :) For sure. I either ways have to review it internally as well :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services