Hi Sri, it's exactly as Alexey says, although there are plans/ideas to improve ParquetIO in a way that would not require defining the schema manually.
Some Jiras that might be interesting in this topic but not yet resolved (maybe you are willing to contribute?): https://issues.apache.org/jira/browse/BEAM-4454 https://issues.apache.org/jira/browse/BEAM-4812 https://issues.apache.org/jira/browse/BEAM-6394 Thanks, Łukasz pon., 14 sty 2019 o 19:16 Alexey Romanenko <aromanenko....@gmail.com> napisał(a): > Hi Sri, > > Afaik, you have to create “PCollection" of "GenericRecord”s and define > your Avro schema manually to write your data into Parquet files. > In this case, you will need to create a ParDo for this translation. Also, > I expect that your schema is the same for all CSV files. > > Basic example of using Parquet Sink with Java SDK could be found here [1] > > [1] https://git.io/fhcfV > > > On 14 Jan 2019, at 02:00, Sridevi Nookala <snook...@parallelwireless.com> > wrote: > > hi, > > I have a bunch of CSV data files that i need to store in Parquet format. I > did look at basic documentation on ParquetIO. and ParquetIO.sink() can be > used to achive the same. > However there is a dependency on the Avro Schema. > how do i infer/generate Avro schema from CSV document data ? > Does beam have any API for the same. > I tried using Kite SDK API CSVUtil / JsonUtil but had no luck generating > avro schema > my CSV data files have headers in them and quite a few of the header > fields are hyphenated which are not liked by Kite 's CSVUtil > > I think it will be a redundant effort to convert CSV documents to json > documents . > Any suggestions on how to infer avro schema from CSV data or a JSON schema > will be helpful > > thanks > Sri > > >