[jira] [Commented] (FLINK-1981) Add GZip support
[ https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576760#comment-14576760 ] ASF GitHub Bot commented on FLINK-1981: --- Github user mxm commented on the pull request: https://github.com/apache/flink/pull/762#issuecomment-109912571 Thank you for your contribution. Add GZip support Key: FLINK-1981 URL: https://issues.apache.org/jira/browse/FLINK-1981 Project: Flink Issue Type: New Feature Components: Core Reporter: Sebastian Kruse Assignee: Sebastian Kruse Priority: Minor GZip, as a commonly used compression format, should be supported in the same way as the already supported deflate files. This allows to use GZip files with any subclass of FileInputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1981) Add GZip support
[ https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576759#comment-14576759 ] ASF GitHub Bot commented on FLINK-1981: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/762 Add GZip support Key: FLINK-1981 URL: https://issues.apache.org/jira/browse/FLINK-1981 Project: Flink Issue Type: New Feature Components: Core Reporter: Sebastian Kruse Assignee: Sebastian Kruse Priority: Minor GZip, as a commonly used compression format, should be supported in the same way as the already supported deflate files. This allows to use GZip files with any subclass of FileInputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1981) Add GZip support
[ https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576746#comment-14576746 ] ASF GitHub Bot commented on FLINK-1981: --- Github user mxm commented on the pull request: https://github.com/apache/flink/pull/762#issuecomment-109907044 Thanks for the documentation. Could you open a JIRA to account for the necessary changes in terms of extensibility? Add GZip support Key: FLINK-1981 URL: https://issues.apache.org/jira/browse/FLINK-1981 Project: Flink Issue Type: New Feature Components: Core Reporter: Sebastian Kruse Assignee: Sebastian Kruse Priority: Minor GZip, as a commonly used compression format, should be supported in the same way as the already supported deflate files. This allows to use GZip files with any subclass of FileInputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1981) Add GZip support
[ https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577248#comment-14577248 ] ASF GitHub Bot commented on FLINK-1981: --- Github user sekruse commented on the pull request: https://github.com/apache/flink/pull/762#issuecomment-110013765 Okay, will do that. Add GZip support Key: FLINK-1981 URL: https://issues.apache.org/jira/browse/FLINK-1981 Project: Flink Issue Type: New Feature Components: Core Reporter: Sebastian Kruse Assignee: Sebastian Kruse Priority: Minor Fix For: 0.9 GZip, as a commonly used compression format, should be supported in the same way as the already supported deflate files. This allows to use GZip files with any subclass of FileInputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1981) Add GZip support
[ https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572471#comment-14572471 ] ASF GitHub Bot commented on FLINK-1981: --- Github user mxm commented on the pull request: https://github.com/apache/flink/pull/762#issuecomment-108812916 :+1: This has been requested multiple times now. I would merge your pull request. Can you add some documentation? Add GZip support Key: FLINK-1981 URL: https://issues.apache.org/jira/browse/FLINK-1981 Project: Flink Issue Type: New Feature Components: Core Reporter: Sebastian Kruse Assignee: Sebastian Kruse Priority: Minor GZip, as a commonly used compression format, should be supported in the same way as the already supported deflate files. This allows to use GZip files with any subclass of FileInputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1981) Add GZip support
[ https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572545#comment-14572545 ] ASF GitHub Bot commented on FLINK-1981: --- Github user sekruse commented on the pull request: https://github.com/apache/flink/pull/762#issuecomment-108844255 Sure, I can do that. Do you talk about a user documentation or more Java docs. And if the former applies, where would I put that documentation preferrably? Add GZip support Key: FLINK-1981 URL: https://issues.apache.org/jira/browse/FLINK-1981 Project: Flink Issue Type: New Feature Components: Core Reporter: Sebastian Kruse Assignee: Sebastian Kruse Priority: Minor GZip, as a commonly used compression format, should be supported in the same way as the already supported deflate files. This allows to use GZip files with any subclass of FileInputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1981) Add GZip support
[ https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572550#comment-14572550 ] ASF GitHub Bot commented on FLINK-1981: --- Github user mxm commented on the pull request: https://github.com/apache/flink/pull/762#issuecomment-108845395 I'm talking about the user documentation. You could mention support for gzip and add an example here: http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#data-sources Add GZip support Key: FLINK-1981 URL: https://issues.apache.org/jira/browse/FLINK-1981 Project: Flink Issue Type: New Feature Components: Core Reporter: Sebastian Kruse Assignee: Sebastian Kruse Priority: Minor GZip, as a commonly used compression format, should be supported in the same way as the already supported deflate files. This allows to use GZip files with any subclass of FileInputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1981) Add GZip support
[ https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572553#comment-14572553 ] ASF GitHub Bot commented on FLINK-1981: --- Github user mxm commented on the pull request: https://github.com/apache/flink/pull/762#issuecomment-108845535 You can modify the documentation in the `docs/apis/programming_guide.md` file. Add GZip support Key: FLINK-1981 URL: https://issues.apache.org/jira/browse/FLINK-1981 Project: Flink Issue Type: New Feature Components: Core Reporter: Sebastian Kruse Assignee: Sebastian Kruse Priority: Minor GZip, as a commonly used compression format, should be supported in the same way as the already supported deflate files. This allows to use GZip files with any subclass of FileInputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1981) Add GZip support
[ https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570882#comment-14570882 ] ASF GitHub Bot commented on FLINK-1981: --- Github user sekruse commented on the pull request: https://github.com/apache/flink/pull/762#issuecomment-108443527 I exchanged that part with the Validate with Preconditions. Add GZip support Key: FLINK-1981 URL: https://issues.apache.org/jira/browse/FLINK-1981 Project: Flink Issue Type: New Feature Components: Core Reporter: Sebastian Kruse Assignee: Sebastian Kruse Priority: Minor GZip, as a commonly used compression format, should be supported in the same way as the already supported deflate files. This allows to use GZip files with any subclass of FileInputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1981) Add GZip support
[ https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569601#comment-14569601 ] ASF GitHub Bot commented on FLINK-1981: --- Github user rmetzger commented on a diff in the pull request: https://github.com/apache/flink/pull/762#discussion_r31560285 --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/FileInputFormat.java --- @@ -628,9 +692,10 @@ public void open(FileInputSplit fileSplit) throws IOException { * @see org.apache.flink.api.common.io.InputStreamFSInputWrapper */ protected FSDataInputStream decorateInputStream(FSDataInputStream inputStream, FileInputSplit fileSplit) throws Throwable { - // Wrap stream in a extracting (decompressing) stream if file ends with .deflate. - if (fileSplit.getPath().getName().endsWith(DEFLATE_SUFFIX)) { - return new InflaterInputStreamFSInputWrapper(stream); + // Wrap stream in a extracting (decompressing) stream if file ends with a known compression file extension. + InflaterInputStreamFactory? inflaterInputStreamFactory = getInflaterInputStreamFactory(fileSplit.getPath()); + if (inflaterInputStreamFactory != null) { + return new InputStreamFSInputWrapper(inflaterInputStreamFactory.create(stream)); --- End diff -- so if there is no inflater input stream available, it will just fall back to the compressed data stream? Wouldn't it better to at least log something or fail? Add GZip support Key: FLINK-1981 URL: https://issues.apache.org/jira/browse/FLINK-1981 Project: Flink Issue Type: New Feature Components: Core Reporter: Sebastian Kruse Assignee: Sebastian Kruse Priority: Minor GZip, as a commonly used compression format, should be supported in the same way as the already supported deflate files. This allows to use GZip files with any subclass of FileInputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1981) Add GZip support
[ https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569625#comment-14569625 ] ASF GitHub Bot commented on FLINK-1981: --- Github user sekruse commented on a diff in the pull request: https://github.com/apache/flink/pull/762#discussion_r31562256 --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/FileInputFormat.java --- @@ -628,9 +692,10 @@ public void open(FileInputSplit fileSplit) throws IOException { * @see org.apache.flink.api.common.io.InputStreamFSInputWrapper */ protected FSDataInputStream decorateInputStream(FSDataInputStream inputStream, FileInputSplit fileSplit) throws Throwable { - // Wrap stream in a extracting (decompressing) stream if file ends with .deflate. - if (fileSplit.getPath().getName().endsWith(DEFLATE_SUFFIX)) { - return new InflaterInputStreamFSInputWrapper(stream); + // Wrap stream in a extracting (decompressing) stream if file ends with a known compression file extension. + InflaterInputStreamFactory? inflaterInputStreamFactory = getInflaterInputStreamFactory(fileSplit.getPath()); + if (inflaterInputStreamFactory != null) { + return new InputStreamFSInputWrapper(inflaterInputStreamFactory.create(stream)); --- End diff -- It might also be the case that the stream was not compressed at all. It would of course be nice to react appropriately to a missing codec, but how would we know if the current input split belongs to an uncompressed file or a compressed file with an unknown codec? Add GZip support Key: FLINK-1981 URL: https://issues.apache.org/jira/browse/FLINK-1981 Project: Flink Issue Type: New Feature Components: Core Reporter: Sebastian Kruse Assignee: Sebastian Kruse Priority: Minor GZip, as a commonly used compression format, should be supported in the same way as the already supported deflate files. This allows to use GZip files with any subclass of FileInputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1981) Add GZip support
[ https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569638#comment-14569638 ] ASF GitHub Bot commented on FLINK-1981: --- Github user rmetzger commented on a diff in the pull request: https://github.com/apache/flink/pull/762#discussion_r31562955 --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/FileInputFormat.java --- @@ -628,9 +692,10 @@ public void open(FileInputSplit fileSplit) throws IOException { * @see org.apache.flink.api.common.io.InputStreamFSInputWrapper */ protected FSDataInputStream decorateInputStream(FSDataInputStream inputStream, FileInputSplit fileSplit) throws Throwable { - // Wrap stream in a extracting (decompressing) stream if file ends with .deflate. - if (fileSplit.getPath().getName().endsWith(DEFLATE_SUFFIX)) { - return new InflaterInputStreamFSInputWrapper(stream); + // Wrap stream in a extracting (decompressing) stream if file ends with a known compression file extension. + InflaterInputStreamFactory? inflaterInputStreamFactory = getInflaterInputStreamFactory(fileSplit.getPath()); + if (inflaterInputStreamFactory != null) { + return new InputStreamFSInputWrapper(inflaterInputStreamFactory.create(stream)); --- End diff -- Ah, okay, I see. I didn't read the code closely enough. Add GZip support Key: FLINK-1981 URL: https://issues.apache.org/jira/browse/FLINK-1981 Project: Flink Issue Type: New Feature Components: Core Reporter: Sebastian Kruse Assignee: Sebastian Kruse Priority: Minor GZip, as a commonly used compression format, should be supported in the same way as the already supported deflate files. This allows to use GZip files with any subclass of FileInputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1981) Add GZip support
[ https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569589#comment-14569589 ] ASF GitHub Bot commented on FLINK-1981: --- Github user rmetzger commented on a diff in the pull request: https://github.com/apache/flink/pull/762#discussion_r31559688 --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/FileInputFormat.java --- @@ -21,10 +21,16 @@ import java.io.IOException; import java.util.ArrayList; import java.util.Arrays; +import java.util.HashMap; import java.util.HashSet; import java.util.List; +import java.util.Map; import java.util.Set; +import org.apache.commons.lang3.Validate; --- End diff -- I'm really sorry that you ran into this, but the community recently decided to use Guava's Preconditions.check() instead of commons lang. Can you replace that? Add GZip support Key: FLINK-1981 URL: https://issues.apache.org/jira/browse/FLINK-1981 Project: Flink Issue Type: New Feature Components: Core Reporter: Sebastian Kruse Assignee: Sebastian Kruse Priority: Minor GZip, as a commonly used compression format, should be supported in the same way as the already supported deflate files. This allows to use GZip files with any subclass of FileInputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1981) Add GZip support
[ https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569425#comment-14569425 ] ASF GitHub Bot commented on FLINK-1981: --- GitHub user sekruse opened a pull request: https://github.com/apache/flink/pull/762 [FLINK-1981] add support for GZIP files * register decompression algorithms with file extensions for extensibility * fit deflate decompression into this scheme * add support for GZIP files * test support for deflate and GZIP files with the CsvInputFormat You can merge this pull request into a Git repository by running: $ git pull https://github.com/sekruse/flink FLINK-1981 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/762.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #762 commit 6acae7faa4e27837ce3c9272d4310ec6c46895ab Author: Sebastian Kruse sebastian.kr...@hpi.de Date: 2015-06-02T16:58:35Z [FLINK-1981] add support for GZIP files * register decompression algorithms with file extensions for extensibility * fit deflate decompression into this scheme * add support for GZIP files * test support for deflate and GZIP files with the CsvInputFormat Add GZip support Key: FLINK-1981 URL: https://issues.apache.org/jira/browse/FLINK-1981 Project: Flink Issue Type: New Feature Components: Core Reporter: Sebastian Kruse Assignee: Sebastian Kruse Priority: Minor GZip, as a commonly used compression format, should be supported in the same way as the already supported deflate files. This allows to use GZip files with any subclass of FileInputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)