[jira] [Commented] (FLINK-8814) Control over the extension of part files created by BucketingSink

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382187#comment-16382187
 ] 

ASF GitHub Bot commented on FLINK-8814:
---

Github user jelmerk closed the pull request at:

https://github.com/apache/flink/pull/5603


> Control over the extension of part files created by BucketingSink
> -
>
> Key: FLINK-8814
> URL: https://issues.apache.org/jira/browse/FLINK-8814
> Project: Flink
>  Issue Type: Improvement
>  Components: Streaming Connectors
>Affects Versions: 1.4.0
>Reporter: Jelmer Kuperus
>Priority: Major
> Fix For: 1.5.0
>
>
> BucketingSink creates files with the following pattern
> {noformat}
> partPrefix + "-" + subtaskIndex + "-" + bucketState.partCounter{noformat}
> When using checkpointing you have no control over the extension of the final 
> files generated. This is incovenient when you are for instance writing files 
> in the avro format because
>  # [Hue|http://gethue.com/] will not be able to render the files as avro See 
> this 
> [file|https://github.com/cloudera/hue/blob/master/apps/filebrowser/src/filebrowser/views.py#L730]
>  # [Spark avro|https://github.com/databricks/spark-avro/] will not be able to 
> read the files unless you set a special property. See [this 
> ticket|https://github.com/databricks/spark-avro/issues/203]
> It would be good if we had the ability to customize the extension of the 
> files created
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8814) Control over the extension of part files created by BucketingSink

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382188#comment-16382188
 ] 

ASF GitHub Bot commented on FLINK-8814:
---

Github user jelmerk commented on the issue:

https://github.com/apache/flink/pull/5603
  
Thanks!


> Control over the extension of part files created by BucketingSink
> -
>
> Key: FLINK-8814
> URL: https://issues.apache.org/jira/browse/FLINK-8814
> Project: Flink
>  Issue Type: Improvement
>  Components: Streaming Connectors
>Affects Versions: 1.4.0
>Reporter: Jelmer Kuperus
>Priority: Major
> Fix For: 1.5.0
>
>
> BucketingSink creates files with the following pattern
> {noformat}
> partPrefix + "-" + subtaskIndex + "-" + bucketState.partCounter{noformat}
> When using checkpointing you have no control over the extension of the final 
> files generated. This is incovenient when you are for instance writing files 
> in the avro format because
>  # [Hue|http://gethue.com/] will not be able to render the files as avro See 
> this 
> [file|https://github.com/cloudera/hue/blob/master/apps/filebrowser/src/filebrowser/views.py#L730]
>  # [Spark avro|https://github.com/databricks/spark-avro/] will not be able to 
> read the files unless you set a special property. See [this 
> ticket|https://github.com/databricks/spark-avro/issues/203]
> It would be good if we had the ability to customize the extension of the 
> files created
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8814) Control over the extension of part files created by BucketingSink

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382089#comment-16382089
 ] 

ASF GitHub Bot commented on FLINK-8814:
---

Github user aljoscha commented on the issue:

https://github.com/apache/flink/pull/5603
  
Merged.  Could you please close the PR?


> Control over the extension of part files created by BucketingSink
> -
>
> Key: FLINK-8814
> URL: https://issues.apache.org/jira/browse/FLINK-8814
> Project: Flink
>  Issue Type: Improvement
>  Components: Streaming Connectors
>Affects Versions: 1.4.0
>Reporter: Jelmer Kuperus
>Priority: Major
>
> BucketingSink creates files with the following pattern
> {noformat}
> partPrefix + "-" + subtaskIndex + "-" + bucketState.partCounter{noformat}
> When using checkpointing you have no control over the extension of the final 
> files generated. This is incovenient when you are for instance writing files 
> in the avro format because
>  # [Hue|http://gethue.com/] will not be able to render the files as avro See 
> this 
> [file|https://github.com/cloudera/hue/blob/master/apps/filebrowser/src/filebrowser/views.py#L730]
>  # [Spark avro|https://github.com/databricks/spark-avro/] will not be able to 
> read the files unless you set a special property. See [this 
> ticket|https://github.com/databricks/spark-avro/issues/203]
> It would be good if we had the ability to customize the extension of the 
> files created
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8814) Control over the extension of part files created by BucketingSink

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381974#comment-16381974
 ] 

ASF GitHub Bot commented on FLINK-8814:
---

Github user aljoscha commented on a diff in the pull request:

https://github.com/apache/flink/pull/5603#discussion_r171553866
  
--- Diff: 
flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java
 ---
@@ -986,6 +996,14 @@ private void 
handlePendingFilesForPreviousCheckpoints(Map pe
return this;
}
 
+   /**
+* Sets the prefix of part files.  The default is no suffix.
--- End diff --

`prefix` -> `suffix`


> Control over the extension of part files created by BucketingSink
> -
>
> Key: FLINK-8814
> URL: https://issues.apache.org/jira/browse/FLINK-8814
> Project: Flink
>  Issue Type: Improvement
>  Components: Streaming Connectors
>Affects Versions: 1.4.0
>Reporter: Jelmer Kuperus
>Priority: Major
>
> BucketingSink creates files with the following pattern
> {noformat}
> partPrefix + "-" + subtaskIndex + "-" + bucketState.partCounter{noformat}
> When using checkpointing you have no control over the extension of the final 
> files generated. This is incovenient when you are for instance writing files 
> in the avro format because
>  # [Hue|http://gethue.com/] will not be able to render the files as avro See 
> this 
> [file|https://github.com/cloudera/hue/blob/master/apps/filebrowser/src/filebrowser/views.py#L730]
>  # [Spark avro|https://github.com/databricks/spark-avro/] will not be able to 
> read the files unless you set a special property. See [this 
> ticket|https://github.com/databricks/spark-avro/issues/203]
> It would be good if we had the ability to customize the extension of the 
> files created
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8814) Control over the extension of part files created by BucketingSink

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381072#comment-16381072
 ] 

ASF GitHub Bot commented on FLINK-8814:
---

GitHub user jelmerk opened a pull request:

https://github.com/apache/flink/pull/5603

[FLINK-8814] Control over the extension of part files created by 
BucketingSink

## What is the purpose of the change

Popular tools like hue and the avro connector for spark require files 
stored on hdfs to have the .avro extension. This patch makes it possible to 
configure a part file suffix

## Brief change log

- adds support for partSuffix in BucketingSink

## Verifying this change

The basic functionality of BucketingSink is verified by BucketingSinkTest. 
The structure of this test makes it awkward to test this in isolation

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): no
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
  - The serializers: no
  - The runtime per-record code paths (performance sensitive): no
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: no
  - The S3 file system connector: no

## Documentation

  - Does this pull request introduce a new feature? yes
  - If yes, how is the feature documented? JavaDocs


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jelmerk/flink FLINK_8814

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/5603.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5603


commit 6487f0fb03870a74b948105ec685462e7b00cbc2
Author: Jelmer Kuperus 
Date:   2018-02-28T20:34:08Z

[FLINK-8814] [file system sinks] Control over the extension of part files 
created by BucketingSink.




> Control over the extension of part files created by BucketingSink
> -
>
> Key: FLINK-8814
> URL: https://issues.apache.org/jira/browse/FLINK-8814
> Project: Flink
>  Issue Type: Improvement
>  Components: Streaming Connectors
>Affects Versions: 1.4.0
>Reporter: Jelmer Kuperus
>Priority: Major
>
> BucketingSink creates files with the following pattern
> {noformat}
> partPrefix + "-" + subtaskIndex + "-" + bucketState.partCounter{noformat}
> When using checkpointing you have no control over the extension of the final 
> files generated. This is incovenient when you are for instance writing files 
> in the avro format because
>  # [Hue|http://gethue.com/] will not be able to render the files as avro See 
> this 
> [file|https://github.com/cloudera/hue/blob/master/apps/filebrowser/src/filebrowser/views.py#L730]
>  # [Spark avro|https://github.com/databricks/spark-avro/] will not be able to 
> read the files unless you set a special property. See [this 
> ticket|https://github.com/databricks/spark-avro/issues/203]
> It would be good if we had the ability to customize the extension of the 
> files created
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)