[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439934#comment-16439934 ] ASF GitHub Bot commented on FLINK-3655: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/1990 > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Assignee: Fabian Hueske >Priority: Major > Labels: starter > Fix For: 1.5.0 > > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366868#comment-16366868 ] ASF GitHub Bot commented on FLINK-3655: --- Github user fhueske commented on the issue: https://github.com/apache/flink/pull/1990 Hi @gna-phetsarath, I rebased and merged your commit as part of PR #5415. Could you please close this PR? Thank you, Fabian > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Assignee: Fabian Hueske >Priority: Major > Labels: starter > Fix For: 1.5.0 > > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366861#comment-16366861 ] ASF GitHub Bot commented on FLINK-3655: --- Github user fhueske closed the pull request at: https://github.com/apache/flink/pull/5415 > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Assignee: Fabian Hueske >Priority: Major > Labels: starter > Fix For: 1.5.0 > > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366685#comment-16366685 ] ASF GitHub Bot commented on FLINK-3655: --- Github user fhueske commented on the issue: https://github.com/apache/flink/pull/5415 Will merge this. > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Assignee: Fabian Hueske >Priority: Major > Labels: starter > Fix For: 1.5.0 > > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363950#comment-16363950 ] Fabian Hueske commented on FLINK-3655: -- Hi [~sjwiesman], I've reworked the PR and opened a new one: https://github.com/apache/flink/pull/5415 Unfortunately, we cannot magically enable this feature for all input formats that are based on {{FileInputFormat}}, because it is a {{@Public}} interface. With the changes that I proposed in the PR, we enable multipath support for the CsvInputFormats, AvroInputFormat, OrcRowInputFormat, and TextInputFormat. All other classes would have to override the {{supportsMultiPaths()}} method. Can you check if the changes in the PR would address your use case? It would be great if you could provide feedback soon because the feature freeze for Flink 1.5.0 will happen in a few days. Thank you, Fabian > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Assignee: Fabian Hueske >Priority: Major > Labels: starter > Fix For: 1.5.0 > > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363907#comment-16363907 ] ASF GitHub Bot commented on FLINK-3655: --- Github user fhueske commented on the issue: https://github.com/apache/flink/pull/5415 Thanks for the review @zentol. I've addressed your feedback, improved the backwards compatibility as discussed offline, and added multi-path support to additional input formats. > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Assignee: Fabian Hueske >Priority: Major > Labels: starter > Fix For: 1.5.0 > > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362276#comment-16362276 ] ASF GitHub Bot commented on FLINK-3655: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5415#discussion_r167849176 --- Diff: flink-core/src/test/java/org/apache/flink/api/common/io/BinaryInputFormatTest.java --- @@ -40,8 +41,12 @@ protected Record deserialize(Record record, DataInputView dataInput) { return record; } - } + @Override + public boolean supportsMultiPaths() { --- End diff -- So this is what users would have to do to enable this feature? > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Assignee: Fabian Hueske >Priority: Major > Labels: starter > Fix For: 1.5.0 > > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362278#comment-16362278 ] ASF GitHub Bot commented on FLINK-3655: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5415#discussion_r167850774 --- Diff: flink-core/src/test/java/org/apache/flink/api/common/io/DelimitedInputFormatTest.java --- @@ -428,6 +431,78 @@ public void testDelimiterOnBufferBoundary() throws IOException { format.close(); } + // -- Statistics --// + + @Test + public void testGetStatistics() throws IOException { + final String myString = "my mocked line 1\nmy mocked line 2\n"; + final long size = myString.length(); + final Path filePath = createTempFilePath(myString); + + final String myString2 = "my mocked line 1\nmy mocked line 2\nanother mocked line3\n"; + final long size2 = myString2.length(); + final Path filePath2 = createTempFilePath(myString2); + + final long totalSize = size + size2; + + DelimitedInputFormat format = new MyTextInputFormat(); + format.setFilePaths(filePath.toUri().toString(), filePath2.toUri().toString()); + + FileInputFormat.FileBaseStatistics stats = format.getStatistics(null); + assertNotNull(stats); + assertEquals("The file size from the statistics is wrong.", totalSize, stats.getTotalInputSize()); + } + + @Test + public void testGetStatisticsFileDoesNotExist() throws IOException { + DelimitedInputFormat format = new MyTextInputFormat(); + format.setFilePaths("file:///path/does/not/really/exist", "file:///another/path/that/does/not/exist"); + + FileBaseStatistics stats = format.getStatistics(null); + assertNull("The file statistics should be null.", stats); + } + + @Test + public void testGetStatisticsSingleFileWithCachedVersion() throws IOException { + final String myString = "my mocked line 1\nmy mocked line 2\n"; + final Path tempFile = createTempFilePath(myString); + final long size = myString.length(); + final long cachedSize = 10065; + + DelimitedInputFormat format = new MyTextInputFormat(); + format.setFilePath(tempFile); + format.configure(new Configuration()); + + FileBaseStatistics stats = format.getStatistics(null); + assertNotNull(stats); + assertEquals("The file size from the statistics is wrong.", size, stats.getTotalInputSize()); + + format = new MyTextInputFormat(); + format.setFilePath(tempFile); + format.configure(new Configuration()); + + FileBaseStatistics newStats = format.getStatistics(stats); + assertEquals("Statistics object was changed.", newStats, stats); + + // insert fake stats with the correct modification time. the call should return the fake stats + format = new MyTextInputFormat(); + format.setFilePath(tempFile); + format.configure(new Configuration()); + + FileBaseStatistics fakeStats = new FileBaseStatistics(stats.getLastModificationTime(), cachedSize, BaseStatistics.AVG_RECORD_BYTES_UNKNOWN); + BaseStatistics latest = format.getStatistics(fakeStats); + assertEquals("The file size from the statistics is wrong.", cachedSize, latest.getTotalInputSize()); + + // insert fake stats with the expired modification time. the call should return new accurate stats + format = new MyTextInputFormat(); + format.setFilePath(tempFile); + format.configure(new Configuration()); + + FileBaseStatistics outDatedFakeStats = new FileBaseStatistics(stats.getLastModificationTime()-1, cachedSize, BaseStatistics.AVG_RECORD_BYTES_UNKNOWN); --- End diff -- missing spaces around `-` > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Assignee: Fabian Hueske >Priority: Major > Labels: starter > Fix For: 1.5.0 > > > Allow comma-separated or multiple directories
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362277#comment-16362277 ] ASF GitHub Bot commented on FLINK-3655: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5415#discussion_r167850694 --- Diff: flink-core/src/test/java/org/apache/flink/api/common/io/DelimitedInputFormatTest.java --- @@ -428,6 +431,78 @@ public void testDelimiterOnBufferBoundary() throws IOException { format.close(); } + // -- Statistics --// + + @Test + public void testGetStatistics() throws IOException { + final String myString = "my mocked line 1\nmy mocked line 2\n"; + final long size = myString.length(); + final Path filePath = createTempFilePath(myString); + + final String myString2 = "my mocked line 1\nmy mocked line 2\nanother mocked line3\n"; + final long size2 = myString2.length(); + final Path filePath2 = createTempFilePath(myString2); + + final long totalSize = size + size2; + + DelimitedInputFormat format = new MyTextInputFormat(); + format.setFilePaths(filePath.toUri().toString(), filePath2.toUri().toString()); + + FileInputFormat.FileBaseStatistics stats = format.getStatistics(null); + assertNotNull(stats); + assertEquals("The file size from the statistics is wrong.", totalSize, stats.getTotalInputSize()); + } + + @Test + public void testGetStatisticsFileDoesNotExist() throws IOException { + DelimitedInputFormat format = new MyTextInputFormat(); + format.setFilePaths("file:///path/does/not/really/exist", "file:///another/path/that/does/not/exist"); + + FileBaseStatistics stats = format.getStatistics(null); + assertNull("The file statistics should be null.", stats); + } + + @Test + public void testGetStatisticsSingleFileWithCachedVersion() throws IOException { + final String myString = "my mocked line 1\nmy mocked line 2\n"; + final Path tempFile = createTempFilePath(myString); + final long size = myString.length(); + final long cachedSize = 10065; --- End diff -- can we rename this to `fakeSize`? > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Assignee: Fabian Hueske >Priority: Major > Labels: starter > Fix For: 1.5.0 > > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362279#comment-16362279 ] ASF GitHub Bot commented on FLINK-3655: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5415#discussion_r167853159 --- Diff: flink-core/src/test/java/org/apache/flink/api/common/io/FileInputFormatTest.java --- @@ -203,7 +390,104 @@ public void testGetStatisticsMultipleFilesWithCachedVersion() { Assert.fail(ex.getMessage()); } } - + + // -- Multiple Files -- // + + @Test + public void testGetStatisticsMultipleNonExistingFile() throws IOException { + final MultiDummyFileInputFormat format = new MultiDummyFileInputFormat(); + format.setFilePaths("file:///some/none/existing/directory/","file:///another/non/existing/directory/"); + format.configure(new Configuration()); + + BaseStatistics stats = format.getStatistics(null); + Assert.assertNull("The file statistics should be null.", stats); + } + + @Test + public void testGetStatisticsMultipleOneFileNoCachedVersion() throws IOException { + final long size1 = 1024 * 500; + String tempFile = TestFileUtils.createTempFile(size1); + + final long size2 = 1024 * 505; + String tempFile2 = TestFileUtils.createTempFile(size2); + + final long totalSize = size1 + size2; + + final MultiDummyFileInputFormat format = new MultiDummyFileInputFormat(); + format.setFilePaths(tempFile, tempFile2); + format.configure(new Configuration()); + + BaseStatistics stats = format.getStatistics(null); + Assert.assertEquals("The file size from the statistics is wrong.", totalSize, stats.getTotalInputSize()); + } + + @Test + public void testGetStatisticsMultipleFilesMultiplePathsNoCachedVersion() throws IOException { + final long size1 = 2077; + final long size2 = 31909; + final long size3 = 10; + final long totalSize123 = size1 + size2 + size3; + + String tempDir = TestFileUtils.createTempFileDir(size1, size2, size3); + + final long size4 = 2051; + final long size5 = 31902; + final long size6 = 15; + final long totalSize456 = size4 + size5 + size6; + String tempDir2 = TestFileUtils.createTempFileDir(size4, size5, size6); + + final MultiDummyFileInputFormat format = new MultiDummyFileInputFormat(); + format.setFilePaths(tempDir, tempDir2); + format.configure(new Configuration()); + + BaseStatistics stats = format.getStatistics(null); + Assert.assertEquals("The file size from the statistics is wrong.", totalSize123 + totalSize456, stats.getTotalInputSize()); + } + + @Test + public void testGetStatisticsMultipleOneFileWithCachedVersion() throws IOException { + final long size1 = 50873; + final long cachedSize = 10065; --- End diff -- rename to `fakeSize` > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Assignee: Fabian Hueske >Priority: Major > Labels: starter > Fix For: 1.5.0 > > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362064#comment-16362064 ] ASF GitHub Bot commented on FLINK-3655: --- Github user fhueske commented on the issue: https://github.com/apache/flink/pull/5415 Thanks for the question @zentol. That might be a cleaner solution, but I don't think we could move much into the shared super class. Everything that's not `private` would need to be duplicated to maintain binary compatibility. However, most method are `protected` or `private`. > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Assignee: Fabian Hueske >Priority: Major > Labels: starter > Fix For: 1.5.0 > > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360838#comment-16360838 ] ASF GitHub Bot commented on FLINK-3655: --- Github user zentol commented on the issue: https://github.com/apache/flink/pull/5415 What would speak against creating a new FileInputFormat that supports multiple paths instead? Common code could be moved into a shared super class (I _think_ that would be allowed). > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Assignee: Fabian Hueske >Priority: Major > Labels: starter > Fix For: 1.5.0 > > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16306527#comment-16306527 ] Fabian Hueske commented on FLINK-3655: -- Hi [~sjwiesman], sorry for the delay. I just had a look at the PR. The changes look mostly good but it breaks the public API in some places. IMO, it could go into Flink 1.5.0 after some adjustments. Because it's more than 1.5 years since the PR has been opened, I'll rework it myself and will open a new PR with my changes put on top of the initial PR. It would be great if you could give it a try once I've opened my PR. Thanks, Fabian > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Priority: Minor > Labels: starter > Fix For: 1.5.0 > > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305745#comment-16305745 ] Seth Wiesman commented on FLINK-3655: - [~fhueske] I have been looking for this feature, is there anything I can do to help get the pr merged ? > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Priority: Minor > Labels: starter > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196030#comment-16196030 ] Fabian Hueske commented on FLINK-3655: -- I don't think there's a particular reason why the PR hasn't been merged. Nobody picked it up and it disappeared from the radar in the list of stale PRs :-/ I'll try to have a look in the next days. > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Priority: Minor > Labels: starter > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195159#comment-16195159 ] Vishnu Viswanath commented on FLINK-3655: - was looking for this feature. why wasn't this ever merged? > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Priority: Minor > Labels: starter > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302164#comment-15302164 ] ASF GitHub Bot commented on FLINK-3655: --- Github user gna-phetsarath commented on the pull request: https://github.com/apache/flink/pull/1990#issuecomment-221890183 You are correct, the majority of the changes were in the "generate splits" method and "statistics" methods which included changes to subclasses that used the file path directly. Not as extensive as it appears. Also, additional tests were added. > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Priority: Minor > Labels: starter > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302050#comment-15302050 ] ASF GitHub Bot commented on FLINK-3655: --- Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/1990#issuecomment-221866397 Thanks for opening that contribution. Can you sum up the changes you made? That would make the review easier. The changes look quite extensive. My gut feeling would be that it should not require so many changes, ideally only an additional loop in the "generate splits" method, and possibly in the "statistics" method. > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Priority: Minor > Labels: starter > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289131#comment-15289131 ] Gna Phetsarath commented on FLINK-3655: --- There's a pull request for this: https://github.com/apache/flink/pull/1990 > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Priority: Minor > Labels: starter > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271314#comment-15271314 ] Gna Phetsarath commented on FLINK-3655: --- What's the progress on this ticket, [~tianli]? > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Priority: Minor > Labels: starter > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227573#comment-15227573 ] Tian, Li commented on FLINK-3655: - Thanks, I will do the path list first and use "readFile(FileInputFormat inputFormat, String.. filePaths)". > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Priority: Minor > Labels: starter > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227545#comment-15227545 ] Tian, Li commented on FLINK-3655: - Will support wildcards > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Priority: Minor > Labels: starter > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15226676#comment-15226676 ] Gna Phetsarath commented on FLINK-3655: --- Will do be doing wildcards as well, or should be put that as another ticket? > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Priority: Minor > Labels: starter > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15226474#comment-15226474 ] Maximilian Michels commented on FLINK-3655: --- Sounds good. It is important to maintain backwards compatibility. I'm not sure about the "comma-separated Path string". File names may contain commas. So we might skip that for now and do the path list first. I think we could also use {{readFile(FileInputFormat inputFormat, String.. filePaths}} which will return the filePath as a {{String[] filepaths}} array. > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Priority: Minor > Labels: starter > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15224354#comment-15224354 ] Tian, Li commented on FLINK-3655: - I think we may need to use List instead of a single Path in FileInputFormat. In this way, we should also 1. modify current implementations to support multiple input paths 2. add functions like setFilePaths, getFilePaths to FileInputFormat, and support comma-seperated Path string in ExecutionEnvironment 3. for backward compatibility, let FileInputFormat.setFilePath set the inputPaths to a one-element list > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Priority: Minor > Labels: starter > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15223797#comment-15223797 ] Maximilian Michels commented on FLINK-3655: --- Hi! Great. Feel free to open a PR. The PR should include tests. Also, could you briefly describe how you want to integrate the feature into the existing code? > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Priority: Minor > Labels: starter > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15222930#comment-15222930 ] Tian Li commented on FLINK-3655: Hi. I would like to contribute for this issue. Thanks. > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Priority: Minor > Labels: starter > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210081#comment-15210081 ] Robert Metzger commented on FLINK-3655: --- Thank you for opening a JIRA for this feature request. I think its a good idea and it shouldn't be too difficult to implement. > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Priority: Minor > Labels: starter > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v6.3.4#6332)