Re: ParquetIO write of CSV document data

Łukasz Gajowy Tue, 15 Jan 2019 04:03:13 -0800

Hi Sri,

it's exactly as Alexey says, although there are plans/ideas to improve
ParquetIO in a way that would not require defining the schema manually.


Some Jiras that might be interesting in this topic but not yet resolved
(maybe you are willing to contribute?):
https://issues.apache.org/jira/browse/BEAM-4454
https://issues.apache.org/jira/browse/BEAM-4812
https://issues.apache.org/jira/browse/BEAM-6394

Thanks,
Łukasz

pon., 14 sty 2019 o 19:16 Alexey Romanenko <aromanenko....@gmail.com>
napisał(a):

> Hi Sri,
>
> Afaik, you have to create “PCollection" of "GenericRecord”s and define
> your Avro schema manually to write your data into Parquet files.
> In this case, you will need to create a ParDo for this translation. Also,
> I expect that your schema is the same for all CSV files.
>
> Basic example of using Parquet Sink with Java SDK could be found here [1]
>
> [1] https://git.io/fhcfV
>
>
> On 14 Jan 2019, at 02:00, Sridevi Nookala <snook...@parallelwireless.com>
> wrote:
>
> hi,
>
> I have a bunch of CSV data files that i need to store in Parquet format. I
> did look at basic documentation on ParquetIO. and ParquetIO.sink() can be
> used to achive the same.
> However there is a dependency on the Avro Schema.
> how do i infer/generate Avro schema from CSV document data ?
> Does beam have any API for the same.
> I tried using Kite SDK API CSVUtil / JsonUtil but had no luck generating
> avro schema
> my CSV data files have headers in them and quite a few of the header
> fields are hyphenated which are not liked by Kite 's CSVUtil
>
> I think it will be a redundant effort to convert CSV documents to json
> documents .
> Any suggestions on how to infer avro schema from CSV data or a JSON schema
> will be helpful
>
> thanks
> Sri
>
>
>

Re: ParquetIO write of CSV document data

Reply via email to