Hi Michael,
Sorry about the inconvenience here; AvroWrapperCoder is indeed removed
recently from Hadoop/HDFS IO.

I think the best approach would be to use HDFSFileSource; this is the only
approach I can recommend today.

Going forward, we are working on being able to read Avro files via AvroIO,
regardless which file system the files may be stored on. So, you'd do
something like AvroIO.Read.from("hdfs://..."), just as you can today do
AvroIO.Read.from("gs://...").

Hope this helps!

Davor

On Tue, Feb 28, 2017 at 4:24 PM, Michael Luckey <[email protected]> wrote:

> Hi all,
>
> we are currently using beam over spark, reading and writing avro files to
> hdfs.
>
> Until now we use HDFSFileSource for reading and HadoopIO for writing,
> essentially reading and writing PCollection<AvroKey<GenericRecord>>
>
> With the changes introduced by https://issues.apache.org/
> jira/browse/BEAM-1497 this seems to be not directly supported anymore by
> beam, as the required AvroWrapperCoder is deleted.
>
> So as we have to change our code anyway, we are wondering, what would be
> the recommended approach to read/write avro files from/to hdfs with beam on
> spark.
>
> - use the new implementation of HDFSFileSource/HDFSFileSink
> - use spark provided HadoopIO (and probably reimplement AvroWrapperCoder
> by ourself?)
>
> What ware the trade offs here, possibly also considering already planned
> changes on IO? Do we have advantages using the spark HadoopIO as our
> underlying engine is currently spark, or will this eventually be deprecated
> and exists only for ‘historical’ reasons?
>
> Any thoughts and advice here?
>
> Regards,
>
> michel
>

Reply via email to