Going forward, how do you plan on handling partitioning in HDFS both on reading and outputting?
I’m thinking about how to implement dynamic partitioning (I don’t know the partitions up front, I’d need to example the input data before I know which partitions will be impacted by the data). Thanks From: Davor Bonaci [mailto:[email protected]] Sent: Monday, March 06, 2017 10:39 PM To: [email protected] Subject: Re: Recommendations on reading/writing avro to hdfs Hi Michael, Sorry about the inconvenience here; AvroWrapperCoder is indeed removed recently from Hadoop/HDFS IO. I think the best approach would be to use HDFSFileSource; this is the only approach I can recommend today. Going forward, we are working on being able to read Avro files via AvroIO, regardless which file system the files may be stored on. So, you'd do something like AvroIO.Read.from("hdfs://..."), just as you can today do AvroIO.Read.from("gs://..."). Hope this helps! Davor On Tue, Feb 28, 2017 at 4:24 PM, Michael Luckey <[email protected]<mailto:[email protected]>> wrote: Hi all, we are currently using beam over spark, reading and writing avro files to hdfs. Until now we use HDFSFileSource for reading and HadoopIO for writing, essentially reading and writing PCollection<AvroKey<GenericRecord>> With the changes introduced by https://issues.apache.org/jira/browse/BEAM-1497<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_BEAM-2D1497&d=DgMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=rlkM70D3djmDN7dGPzzbVKG26ShcTFDMKlX5AWucE5Q&m=3MhoBenXB9bCFojE9-pShCozbulWkKkVIcK44-1Zz9M&s=3FQTwzLGVM-kp64wQLR9ax8FWIMyH3UGm-DaDXf_Zr8&e=> this seems to be not directly supported anymore by beam, as the required AvroWrapperCoder is deleted. So as we have to change our code anyway, we are wondering, what would be the recommended approach to read/write avro files from/to hdfs with beam on spark. - use the new implementation of HDFSFileSource/HDFSFileSink - use spark provided HadoopIO (and probably reimplement AvroWrapperCoder by ourself?) What ware the trade offs here, possibly also considering already planned changes on IO? Do we have advantages using the spark HadoopIO as our underlying engine is currently spark, or will this eventually be deprecated and exists only for ‘historical’ reasons? Any thoughts and advice here? Regards, michel
