Re: Plan for a Parquet new release and writing Parquet file with outputstream

2018-03-28 Thread Jean-Baptiste Onofré
Hi Ryan, sorry to have been quite, but I was busy traveling recently :) Just a quick update about this one: - I asked a guy from my team to work with me on the Beam ParquetIO. We're also seeing several users expected this new IO. - I will update my current PR to use Parquet SNAPSHOT and verify t

Re: Plan for a Parquet new release and writing Parquet file with outputstream

2018-02-14 Thread Jean-Baptiste Onofré
Great !! In the mean time, I started to PoC around directly parquet-common to see if I can implement a BeamParquetReader and a BeamParquetWriter. I might also propose some PRs. I will continue tomorrow around that. Thanks again ! Regards JB On 02/14/2018 08:04 PM, Ryan Blue wrote: > Additions

Re: Plan for a Parquet new release and writing Parquet file with outputstream

2018-02-14 Thread Ryan Blue
Additions to the builders are easy enough that we can get that in. There's a PR out there that needs to be fixed: https://github.com/apache/parquet-mr/pull/446 I've asked the author for just the builder changes. If we don't hear back, we can add another PR but I'd like to give the author some time

Re: Plan for a Parquet new release and writing Parquet file with outputstream

2018-02-13 Thread Jean-Baptiste Onofré
Hi Ryan, Thanks for the update. Ideally for Beam, it would be great to have the AvroParquetReader and AvroParquetWriter using the InputFile/OutputFile interfaces. It would allow me to directly leverage Beam FileIO. Do you have a rough date for the Parquet release with that ? Thanks Regards JB

Re: Plan for a Parquet new release and writing Parquet file with outputstream

2018-02-13 Thread Ryan Blue
Jean-Baptiste, We're planning a release that will include the new OutputFile class, which I think you should be able to use. Is there anything you'd change to make this work more easily with Beam? rb On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré wrote: > Hi guys, > > I'm working on th

Plan for a Parquet new release and writing Parquet file with outputstream

2018-02-13 Thread Jean-Baptiste Onofré
Hi guys, I'm working on the Apache Beam ParquetIO: https://github.com/apache/beam/pull/1851 In Beam, thanks to FileIO, we support several filesystems (HDFS, S3, ...). If I was able to implement the Read part using AvroParquetReader leveraging Beam FileIO, I'm struggling on the writing part. I