I think what you're asking should be doable but requires modifications in the ParquetIO code. It uses schema in 2 places: - read: to setCoder on the PCollection [1]. As long as there already is a way to set the coder in a different way that does not require the Avro Schema we're good to go there (at the time of developing ParquetIO I don't think there was). From the doc mentioned above, I suspect that SchemaCoder may be the best fit for that. - write: avro schema is used by AvroParquetWriter.builder() it explicitly requires the Avro schema. [2] I think we could accept the Beam's schema as long as there's a way to transform it to Avro Schema. I think it's doable but we would need (for example) to transform Beam's schema to json and then pass it to Avro's new Schema.Parser().parse() method to get Avro's schema for the builder.
[1] https://github.com/apache/beam/blob/641678d242fc76229329a2ac5e61072faf93aceb/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L195 [2] https://github.com/apache/beam/blob/641678d242fc76229329a2ac5e61072faf93aceb/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L285 Best regards, Łukasz wt., 17 lip 2018 o 09:52 Akanksha Sharma B <[email protected]> napisał(a): > Hi, > > > ParquetIO needs avro Schema(org.apache.avro.Schema) to read and write. > > Will it also be possible not to use any avro Schema at all or use Beams > Schema (org.apache.beam.sdk.schemas.Schema) > > Regards, > Akanksha > > ------------------------------ > *From:* Akanksha Sharma B > *Sent:* Thursday, July 12, 2018 1:13:14 PM > *To:* [email protected] > *Subject:* Re: Schema class in 2.5 ? > > > Hi, > > > Created https://issues.apache.org/jira/browse/BEAM-4771 > > > Regards, > > Akanksha > ------------------------------ > *From:* Alexey Romanenko <[email protected]> > *Sent:* Thursday, July 12, 2018 12:31:02 PM > *To:* [email protected] > *Subject:* Re: Schema class in 2.5 ? > > Good catch, Akanksha! > Yes, RowType was renamed to Schema a while ago and BeamSQL doc seems was > not updated. > Could you create a Jira issue for that? > > On 12 Jul 2018, at 11:10, Akanksha Sharma B < > [email protected]> wrote: > > Hi, > > As I see, in 2.5 BeamSQL had been changed to work with Schema. > The sample code provided in > https://beam.apache.org/documentation/dsls/sql/walkthrough/ does not > compile with Beam 2.5, and needs to be updated. > > Row > .withRowType(appType) > > The above mentioned line needs to be adapted to use schema. > Beam SQL: Walkthrough > <https://beam.apache.org/documentation/dsls/sql/walkthrough/> > beam.apache.org > Apache Beam is an open source, unified model and set of language-specific > SDKs for defining and executing data processing workflows, and also data > ingestion and integration flows, supporting Enterprise Integration Patterns > (EIPs) and Domain Specific Languages (DSLs). > Regards, > > Akanksha > ------------------------------ > *From:* Akanksha Sharma B > *Sent:* Wednesday, July 11, 2018 11:02:37 AM > *To:* [email protected] > *Subject:* Re: Schema class in 2.5 ? > > Thanks a lot!!! > ------------------------------ > *From:* Alexey Romanenko <[email protected]> > *Sent:* Wednesday, July 11, 2018 11:01:05 AM > *To:* [email protected] > *Subject:* Re: Schema class in 2.5 ? > > Hi Akanksha, > > I believe this design document can be helpful for you: > > https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc > > On 11 Jul 2018, at 10:38, Akanksha Sharma B < > [email protected]> wrote: > > Hi, > > Can you please share some documentation about ongoing changes related to > Schema class. > I am looking to understand why is it being introduced and how can I use it. > I was looking for something like RDD in Beam, i.e. Beam understands schema > of data internally and thus can handle some conversions itself, e.g. to > SqlRow, ParquetFile etc. > > Regards, > Akanksha > > >
