[jira] [Commented] (FLINK-8814) Control over the extension of part files created by BucketingSink
[ https://issues.apache.org/jira/browse/FLINK-8814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382187#comment-16382187 ] ASF GitHub Bot commented on FLINK-8814: --- Github user jelmerk closed the pull request at: https://github.com/apache/flink/pull/5603 > Control over the extension of part files created by BucketingSink > - > > Key: FLINK-8814 > URL: https://issues.apache.org/jira/browse/FLINK-8814 > Project: Flink > Issue Type: Improvement > Components: Streaming Connectors >Affects Versions: 1.4.0 >Reporter: Jelmer Kuperus >Priority: Major > Fix For: 1.5.0 > > > BucketingSink creates files with the following pattern > {noformat} > partPrefix + "-" + subtaskIndex + "-" + bucketState.partCounter{noformat} > When using checkpointing you have no control over the extension of the final > files generated. This is incovenient when you are for instance writing files > in the avro format because > # [Hue|http://gethue.com/] will not be able to render the files as avro See > this > [file|https://github.com/cloudera/hue/blob/master/apps/filebrowser/src/filebrowser/views.py#L730] > # [Spark avro|https://github.com/databricks/spark-avro/] will not be able to > read the files unless you set a special property. See [this > ticket|https://github.com/databricks/spark-avro/issues/203] > It would be good if we had the ability to customize the extension of the > files created > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8814) Control over the extension of part files created by BucketingSink
[ https://issues.apache.org/jira/browse/FLINK-8814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382188#comment-16382188 ] ASF GitHub Bot commented on FLINK-8814: --- Github user jelmerk commented on the issue: https://github.com/apache/flink/pull/5603 Thanks! > Control over the extension of part files created by BucketingSink > - > > Key: FLINK-8814 > URL: https://issues.apache.org/jira/browse/FLINK-8814 > Project: Flink > Issue Type: Improvement > Components: Streaming Connectors >Affects Versions: 1.4.0 >Reporter: Jelmer Kuperus >Priority: Major > Fix For: 1.5.0 > > > BucketingSink creates files with the following pattern > {noformat} > partPrefix + "-" + subtaskIndex + "-" + bucketState.partCounter{noformat} > When using checkpointing you have no control over the extension of the final > files generated. This is incovenient when you are for instance writing files > in the avro format because > # [Hue|http://gethue.com/] will not be able to render the files as avro See > this > [file|https://github.com/cloudera/hue/blob/master/apps/filebrowser/src/filebrowser/views.py#L730] > # [Spark avro|https://github.com/databricks/spark-avro/] will not be able to > read the files unless you set a special property. See [this > ticket|https://github.com/databricks/spark-avro/issues/203] > It would be good if we had the ability to customize the extension of the > files created > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8814) Control over the extension of part files created by BucketingSink
[ https://issues.apache.org/jira/browse/FLINK-8814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382089#comment-16382089 ] ASF GitHub Bot commented on FLINK-8814: --- Github user aljoscha commented on the issue: https://github.com/apache/flink/pull/5603 Merged. Could you please close the PR? > Control over the extension of part files created by BucketingSink > - > > Key: FLINK-8814 > URL: https://issues.apache.org/jira/browse/FLINK-8814 > Project: Flink > Issue Type: Improvement > Components: Streaming Connectors >Affects Versions: 1.4.0 >Reporter: Jelmer Kuperus >Priority: Major > > BucketingSink creates files with the following pattern > {noformat} > partPrefix + "-" + subtaskIndex + "-" + bucketState.partCounter{noformat} > When using checkpointing you have no control over the extension of the final > files generated. This is incovenient when you are for instance writing files > in the avro format because > # [Hue|http://gethue.com/] will not be able to render the files as avro See > this > [file|https://github.com/cloudera/hue/blob/master/apps/filebrowser/src/filebrowser/views.py#L730] > # [Spark avro|https://github.com/databricks/spark-avro/] will not be able to > read the files unless you set a special property. See [this > ticket|https://github.com/databricks/spark-avro/issues/203] > It would be good if we had the ability to customize the extension of the > files created > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8814) Control over the extension of part files created by BucketingSink
[ https://issues.apache.org/jira/browse/FLINK-8814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381974#comment-16381974 ] ASF GitHub Bot commented on FLINK-8814: --- Github user aljoscha commented on a diff in the pull request: https://github.com/apache/flink/pull/5603#discussion_r171553866 --- Diff: flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java --- @@ -986,6 +996,14 @@ private void handlePendingFilesForPreviousCheckpoints(Mappe return this; } + /** +* Sets the prefix of part files. The default is no suffix. --- End diff -- `prefix` -> `suffix` > Control over the extension of part files created by BucketingSink > - > > Key: FLINK-8814 > URL: https://issues.apache.org/jira/browse/FLINK-8814 > Project: Flink > Issue Type: Improvement > Components: Streaming Connectors >Affects Versions: 1.4.0 >Reporter: Jelmer Kuperus >Priority: Major > > BucketingSink creates files with the following pattern > {noformat} > partPrefix + "-" + subtaskIndex + "-" + bucketState.partCounter{noformat} > When using checkpointing you have no control over the extension of the final > files generated. This is incovenient when you are for instance writing files > in the avro format because > # [Hue|http://gethue.com/] will not be able to render the files as avro See > this > [file|https://github.com/cloudera/hue/blob/master/apps/filebrowser/src/filebrowser/views.py#L730] > # [Spark avro|https://github.com/databricks/spark-avro/] will not be able to > read the files unless you set a special property. See [this > ticket|https://github.com/databricks/spark-avro/issues/203] > It would be good if we had the ability to customize the extension of the > files created > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8814) Control over the extension of part files created by BucketingSink
[ https://issues.apache.org/jira/browse/FLINK-8814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381072#comment-16381072 ] ASF GitHub Bot commented on FLINK-8814: --- GitHub user jelmerk opened a pull request: https://github.com/apache/flink/pull/5603 [FLINK-8814] Control over the extension of part files created by BucketingSink ## What is the purpose of the change Popular tools like hue and the avro connector for spark require files stored on hdfs to have the .avro extension. This patch makes it possible to configure a part file suffix ## Brief change log - adds support for partSuffix in BucketingSink ## Verifying this change The basic functionality of BucketingSink is verified by BucketingSinkTest. The structure of this test makes it awkward to test this in isolation ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no - The serializers: no - The runtime per-record code paths (performance sensitive): no - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no - The S3 file system connector: no ## Documentation - Does this pull request introduce a new feature? yes - If yes, how is the feature documented? JavaDocs You can merge this pull request into a Git repository by running: $ git pull https://github.com/jelmerk/flink FLINK_8814 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/5603.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5603 commit 6487f0fb03870a74b948105ec685462e7b00cbc2 Author: Jelmer KuperusDate: 2018-02-28T20:34:08Z [FLINK-8814] [file system sinks] Control over the extension of part files created by BucketingSink. > Control over the extension of part files created by BucketingSink > - > > Key: FLINK-8814 > URL: https://issues.apache.org/jira/browse/FLINK-8814 > Project: Flink > Issue Type: Improvement > Components: Streaming Connectors >Affects Versions: 1.4.0 >Reporter: Jelmer Kuperus >Priority: Major > > BucketingSink creates files with the following pattern > {noformat} > partPrefix + "-" + subtaskIndex + "-" + bucketState.partCounter{noformat} > When using checkpointing you have no control over the extension of the final > files generated. This is incovenient when you are for instance writing files > in the avro format because > # [Hue|http://gethue.com/] will not be able to render the files as avro See > this > [file|https://github.com/cloudera/hue/blob/master/apps/filebrowser/src/filebrowser/views.py#L730] > # [Spark avro|https://github.com/databricks/spark-avro/] will not be able to > read the files unless you set a special property. See [this > ticket|https://github.com/databricks/spark-avro/issues/203] > It would be good if we had the ability to customize the extension of the > files created > -- This message was sent by Atlassian JIRA (v7.6.3#76005)