Re:RE: Re:Re: File Naming Pattern from HadoopOutputFormat

Haibo Sun Wed, 03 Jul 2019 18:37:44 -0700

Hi, Andreas


I'm glad you have had a solution. If you're interested in option 2 I talked 
about, you can follow up on the progress of the issue 
(https://issues.apache.org/jira/browse/FLINK-12573) that Yitzchak said by 
watching it.


Best,
Haibo

At 2019-07-03 21:11:44, "Hailu, Andreas" <andreas.ha...@gs.com> wrote:


Hi Haibo, Yitzchak, thanks for getting back to me.

 

The pattern I chose to use which worked was to extend the HadoopOutputFormat 
class, override the open() method, and modify the “mapreduce.output.basename” 
configuration property to match my desired file naming structure.

 

// ah

 

From: Haibo Sun <sunhaib...@163.com>
Sent: Tuesday, July 2, 2019 5:57 AM
To: Yitzchak Lieberman <yitzch...@sentinelone.com>
Cc: Hailu, Andreas [Tech] <andreas.ha...@ny.email.gs.com>; user@flink.apache.org
Subject: Re:Re: File Naming Pattern from HadoopOutputFormat

 


Hi, Andreas 

 

You are right. To meet this requirement, Flink should need to expose a 
interface to allow customizing the filename.

 

Best,

Haibo


At 2019-07-02 16:33:44, "Yitzchak Lieberman" <yitzch...@sentinelone.com> wrote:



regarding option 2 for parquet:

implementing bucket assigner won't set the file name as getBucketId() defined 
the directory for the files in case of partitioning the data, for example:

<root dir>/day=20190101/part-1-1

there is an open issue for that: 
https://issues.apache.org/jira/browse/FLINK-12573

 

On Tue, Jul 2, 2019 at 6:18 AM Haibo Sun <sunhaib...@163.com> wrote:

Hi, Andreas

 

I think the following things may be what you want.

 

1. For writing Avro, I think you can extend AvroOutputFormat and override the  
getDirectoryFileName() method to customize a file name, as shown below.

The javadoc of AvroOutputFormat: 
https://ci.apache.org/projects/flink/flink-docs-release-1.8/api/java/org/apache/flink/formats/avro/AvroOutputFormat.html

 

          public static class CustomAvroOutputFormat extends AvroOutputFormat {
                              public CustomAvroOutputFormat(Path filePath, 
Class type) {
                                                   super(filePath, type);
                              }
 
                              public CustomAvroOutputFormat(Class type) {
                                                   super(type);
                              }
 
                              @Override
                              public void open(int taskNumber, int numTasks) 
throws IOException {
                                                   
this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);
                                                   super.open(taskNumber, 
numTasks);
                              }
 
                              @Override
                              protected String getDirectoryFileName(int 
taskNumber) {
                                                   // returns a custom filename
                                                   return null;
                              }
          }

 

2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase, 
StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create a 
class that implements the BucketAssigner interface and return a custom file 
name in the getBucketId() method (the value returned by getBucketId() will be 
treated as the file name).

 

ParquetStreamingFileSinkITCase:  
https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java

 

StreamingFileSink#forBulkFormat: 
https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.java

 

DateTimeBucketAssigner: 
https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBucketAssigner.java

 

 

Best,

Haibo


At 2019-07-02 04:15:07, "Hailu, Andreas" <andreas.ha...@gs.com> wrote:



Hello Flink team,

 

I’m writing Avro and Parquet files to HDFS, and I’ve would like to include a 
UUID as a part of the file name.

 

Our files in HDFS currently follow this pattern:

 

tmp-r-00001.snappy.parquet

tmp-r-00002.snappy.parquet

...

 

I’m using a custom output format which extends a RichOutputFormat - is this 
something which is natively supported? If so, could you please recommend how 
this could be done, or share the relevant document?

 

Best,

Andreas

 


Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices




Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices

Re:RE: Re:Re: File Naming Pattern from HadoopOutputFormat

Reply via email to