RE: Recommendations on reading/writing avro to hdfs

Newport, Billy Tue, 07 Mar 2017 04:29:51 -0800

Going forward, how do you plan on handling partitioning in HDFS both on reading 
and outputting?


I’m thinking about how to implement dynamic partitioning (I don’t know the 
partitions up front, I’d need to example the input data before I know which 
partitions will be impacted by the data).

Thanks


From: Davor Bonaci [mailto:[email protected]]
Sent: Monday, March 06, 2017 10:39 PM
To: [email protected]
Subject: Re: Recommendations on reading/writing avro to hdfs

Hi Michael,
Sorry about the inconvenience here; AvroWrapperCoder is indeed removed recently 
from Hadoop/HDFS IO.

I think the best approach would be to use HDFSFileSource; this is the only 
approach I can recommend today.

Going forward, we are working on being able to read Avro files via AvroIO, 
regardless which file system the files may be stored on. So, you'd do something 
like AvroIO.Read.from("hdfs://..."), just as you can today do 
AvroIO.Read.from("gs://...").

Hope this helps!

Davor

On Tue, Feb 28, 2017 at 4:24 PM, Michael Luckey 
<[email protected]<mailto:[email protected]>> wrote:
Hi all,

we are currently using beam over spark, reading and writing avro files to hdfs.

Until now we use HDFSFileSource for reading and HadoopIO for writing, 
essentially reading and writing PCollection<AvroKey<GenericRecord>>

With the changes introduced by 
https://issues.apache.org/jira/browse/BEAM-1497<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_BEAM-2D1497&d=DgMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=rlkM70D3djmDN7dGPzzbVKG26ShcTFDMKlX5AWucE5Q&m=3MhoBenXB9bCFojE9-pShCozbulWkKkVIcK44-1Zz9M&s=3FQTwzLGVM-kp64wQLR9ax8FWIMyH3UGm-DaDXf_Zr8&e=>
 this seems to be not directly supported anymore by beam, as the required 
AvroWrapperCoder is deleted.

So as we have to change our code anyway, we are wondering, what would be the 
recommended approach to read/write avro files from/to hdfs with beam on spark.

- use the new implementation of HDFSFileSource/HDFSFileSink
- use spark provided HadoopIO (and probably reimplement AvroWrapperCoder by 
ourself?)

What ware the trade offs here, possibly also considering already planned 
changes on IO? Do we have advantages using the spark HadoopIO as our underlying 
engine is currently spark, or will this eventually be deprecated and exists 
only for ‘historical’ reasons?

Any thoughts and advice here?

Regards,

michel

RE: Recommendations on reading/writing avro to hdfs

Reply via email to