[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2018-04-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439934#comment-16439934
 ] 

ASF GitHub Bot commented on FLINK-3655:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/1990


> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Assignee: Fabian Hueske
>Priority: Major
>  Labels: starter
> Fix For: 1.5.0
>
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2018-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366868#comment-16366868
 ] 

ASF GitHub Bot commented on FLINK-3655:
---

Github user fhueske commented on the issue:

https://github.com/apache/flink/pull/1990
  
Hi @gna-phetsarath, I rebased and merged your commit as part of PR #5415.
Could you please close this PR?

Thank you, Fabian


> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Assignee: Fabian Hueske
>Priority: Major
>  Labels: starter
> Fix For: 1.5.0
>
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2018-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366861#comment-16366861
 ] 

ASF GitHub Bot commented on FLINK-3655:
---

Github user fhueske closed the pull request at:

https://github.com/apache/flink/pull/5415


> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Assignee: Fabian Hueske
>Priority: Major
>  Labels: starter
> Fix For: 1.5.0
>
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2018-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366685#comment-16366685
 ] 

ASF GitHub Bot commented on FLINK-3655:
---

Github user fhueske commented on the issue:

https://github.com/apache/flink/pull/5415
  
Will merge this.


> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Assignee: Fabian Hueske
>Priority: Major
>  Labels: starter
> Fix For: 1.5.0
>
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2018-02-14 Thread Fabian Hueske (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363950#comment-16363950
 ] 

Fabian Hueske commented on FLINK-3655:
--

Hi [~sjwiesman],

I've reworked the PR and opened a new one: 
https://github.com/apache/flink/pull/5415

Unfortunately, we cannot magically enable this feature for all input formats 
that are based on {{FileInputFormat}}, because it is a {{@Public}} interface.
With the changes that I proposed in the PR, we enable multipath support for the 
CsvInputFormats, AvroInputFormat, OrcRowInputFormat, and TextInputFormat. All 
other classes would have to override the {{supportsMultiPaths()}} method.

Can you check if the changes in the PR would address your use case?
It would be great if you could provide feedback soon because the feature freeze 
for Flink 1.5.0 will happen in a few days.

Thank you, Fabian

> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Assignee: Fabian Hueske
>Priority: Major
>  Labels: starter
> Fix For: 1.5.0
>
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2018-02-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363907#comment-16363907
 ] 

ASF GitHub Bot commented on FLINK-3655:
---

Github user fhueske commented on the issue:

https://github.com/apache/flink/pull/5415
  
Thanks for the review @zentol. 

I've addressed your feedback, improved the backwards compatibility as 
discussed offline, and added multi-path support to additional input formats.


> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Assignee: Fabian Hueske
>Priority: Major
>  Labels: starter
> Fix For: 1.5.0
>
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2018-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362276#comment-16362276
 ] 

ASF GitHub Bot commented on FLINK-3655:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5415#discussion_r167849176
  
--- Diff: 
flink-core/src/test/java/org/apache/flink/api/common/io/BinaryInputFormatTest.java
 ---
@@ -40,8 +41,12 @@
protected Record deserialize(Record record, DataInputView 
dataInput) {
return record;
}
-   }
 
+   @Override
+   public boolean supportsMultiPaths() {
--- End diff --

So this is what users would have to do to enable this feature?


> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Assignee: Fabian Hueske
>Priority: Major
>  Labels: starter
> Fix For: 1.5.0
>
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2018-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362278#comment-16362278
 ] 

ASF GitHub Bot commented on FLINK-3655:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5415#discussion_r167850774
  
--- Diff: 
flink-core/src/test/java/org/apache/flink/api/common/io/DelimitedInputFormatTest.java
 ---
@@ -428,6 +431,78 @@ public void testDelimiterOnBufferBoundary() throws 
IOException {
format.close();
}
 
+   // -- Statistics --//
+
+   @Test
+   public void testGetStatistics() throws IOException {
+   final String myString = "my mocked line 1\nmy mocked line 2\n";
+   final long size = myString.length();
+   final Path filePath = createTempFilePath(myString);
+
+   final String myString2 = "my mocked line 1\nmy mocked line 
2\nanother mocked line3\n";
+   final long size2 = myString2.length();
+   final Path filePath2 = createTempFilePath(myString2);
+
+   final long totalSize = size + size2;
+
+   DelimitedInputFormat format = new MyTextInputFormat();
+   format.setFilePaths(filePath.toUri().toString(), 
filePath2.toUri().toString());
+
+   FileInputFormat.FileBaseStatistics stats = 
format.getStatistics(null);
+   assertNotNull(stats);
+   assertEquals("The file size from the statistics is wrong.", 
totalSize, stats.getTotalInputSize());
+   }
+   
+   @Test
+   public void testGetStatisticsFileDoesNotExist() throws IOException {
+   DelimitedInputFormat format = new MyTextInputFormat();
+   format.setFilePaths("file:///path/does/not/really/exist", 
"file:///another/path/that/does/not/exist");
+
+   FileBaseStatistics stats = format.getStatistics(null);
+   assertNull("The file statistics should be null.", stats);
+   }
+
+   @Test
+   public void testGetStatisticsSingleFileWithCachedVersion() throws 
IOException {
+   final String myString = "my mocked line 1\nmy mocked line 2\n";
+   final Path tempFile = createTempFilePath(myString);
+   final long size = myString.length();
+   final long cachedSize = 10065;
+
+   DelimitedInputFormat format = new MyTextInputFormat();
+   format.setFilePath(tempFile);
+   format.configure(new Configuration());
+
+   FileBaseStatistics stats = format.getStatistics(null);
+   assertNotNull(stats);
+   assertEquals("The file size from the statistics is wrong.", 
size, stats.getTotalInputSize());
+   
+   format = new MyTextInputFormat();
+   format.setFilePath(tempFile);
+   format.configure(new Configuration());
+   
+   FileBaseStatistics newStats = format.getStatistics(stats);
+   assertEquals("Statistics object was changed.", newStats, stats);
+   
+   // insert fake stats with the correct modification time. the 
call should return the fake stats
+   format = new MyTextInputFormat();
+   format.setFilePath(tempFile);
+   format.configure(new Configuration());
+   
+   FileBaseStatistics fakeStats = new 
FileBaseStatistics(stats.getLastModificationTime(), cachedSize, 
BaseStatistics.AVG_RECORD_BYTES_UNKNOWN);
+   BaseStatistics latest = format.getStatistics(fakeStats);
+   assertEquals("The file size from the statistics is wrong.", 
cachedSize, latest.getTotalInputSize());
+   
+   // insert fake stats with the expired modification time. the 
call should return new accurate stats
+   format = new MyTextInputFormat();
+   format.setFilePath(tempFile);
+   format.configure(new Configuration());
+   
+   FileBaseStatistics outDatedFakeStats = new 
FileBaseStatistics(stats.getLastModificationTime()-1, cachedSize, 
BaseStatistics.AVG_RECORD_BYTES_UNKNOWN);
--- End diff --

missing spaces around `-`


> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Assignee: Fabian Hueske
>Priority: Major
>  Labels: starter
> Fix For: 1.5.0
>
>
> Allow comma-separated or multiple directories 

[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2018-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362277#comment-16362277
 ] 

ASF GitHub Bot commented on FLINK-3655:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5415#discussion_r167850694
  
--- Diff: 
flink-core/src/test/java/org/apache/flink/api/common/io/DelimitedInputFormatTest.java
 ---
@@ -428,6 +431,78 @@ public void testDelimiterOnBufferBoundary() throws 
IOException {
format.close();
}
 
+   // -- Statistics --//
+
+   @Test
+   public void testGetStatistics() throws IOException {
+   final String myString = "my mocked line 1\nmy mocked line 2\n";
+   final long size = myString.length();
+   final Path filePath = createTempFilePath(myString);
+
+   final String myString2 = "my mocked line 1\nmy mocked line 
2\nanother mocked line3\n";
+   final long size2 = myString2.length();
+   final Path filePath2 = createTempFilePath(myString2);
+
+   final long totalSize = size + size2;
+
+   DelimitedInputFormat format = new MyTextInputFormat();
+   format.setFilePaths(filePath.toUri().toString(), 
filePath2.toUri().toString());
+
+   FileInputFormat.FileBaseStatistics stats = 
format.getStatistics(null);
+   assertNotNull(stats);
+   assertEquals("The file size from the statistics is wrong.", 
totalSize, stats.getTotalInputSize());
+   }
+   
+   @Test
+   public void testGetStatisticsFileDoesNotExist() throws IOException {
+   DelimitedInputFormat format = new MyTextInputFormat();
+   format.setFilePaths("file:///path/does/not/really/exist", 
"file:///another/path/that/does/not/exist");
+
+   FileBaseStatistics stats = format.getStatistics(null);
+   assertNull("The file statistics should be null.", stats);
+   }
+
+   @Test
+   public void testGetStatisticsSingleFileWithCachedVersion() throws 
IOException {
+   final String myString = "my mocked line 1\nmy mocked line 2\n";
+   final Path tempFile = createTempFilePath(myString);
+   final long size = myString.length();
+   final long cachedSize = 10065;
--- End diff --

can we rename this to `fakeSize`?


> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Assignee: Fabian Hueske
>Priority: Major
>  Labels: starter
> Fix For: 1.5.0
>
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2018-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362279#comment-16362279
 ] 

ASF GitHub Bot commented on FLINK-3655:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5415#discussion_r167853159
  
--- Diff: 
flink-core/src/test/java/org/apache/flink/api/common/io/FileInputFormatTest.java
 ---
@@ -203,7 +390,104 @@ public void 
testGetStatisticsMultipleFilesWithCachedVersion() {
Assert.fail(ex.getMessage());
}
}
-
+   
+   // -- Multiple Files -- //
+   
+   @Test
+   public void testGetStatisticsMultipleNonExistingFile() throws 
IOException {
+   final MultiDummyFileInputFormat format = new 
MultiDummyFileInputFormat();
+   
format.setFilePaths("file:///some/none/existing/directory/","file:///another/non/existing/directory/");
+   format.configure(new Configuration());
+   
+   BaseStatistics stats = format.getStatistics(null);
+   Assert.assertNull("The file statistics should be null.", stats);
+   }
+   
+   @Test
+   public void testGetStatisticsMultipleOneFileNoCachedVersion() throws 
IOException {
+   final long size1 = 1024 * 500;
+   String tempFile = TestFileUtils.createTempFile(size1);
+
+   final long size2 = 1024 * 505;
+   String tempFile2 = TestFileUtils.createTempFile(size2);
+
+   final long totalSize = size1 + size2;
+   
+   final MultiDummyFileInputFormat format = new 
MultiDummyFileInputFormat();
+   format.setFilePaths(tempFile, tempFile2);
+   format.configure(new Configuration());
+   
+   BaseStatistics stats = format.getStatistics(null);
+   Assert.assertEquals("The file size from the statistics is 
wrong.", totalSize, stats.getTotalInputSize());
+   }
+   
+   @Test
+   public void 
testGetStatisticsMultipleFilesMultiplePathsNoCachedVersion() throws IOException 
{
+   final long size1 = 2077;
+   final long size2 = 31909;
+   final long size3 = 10;
+   final long totalSize123 = size1 + size2 + size3;
+   
+   String tempDir = TestFileUtils.createTempFileDir(size1, size2, 
size3);
+   
+   final long size4 = 2051;
+   final long size5 = 31902;
+   final long size6 = 15;
+   final long totalSize456 = size4 + size5 + size6;
+   String tempDir2 = TestFileUtils.createTempFileDir(size4, size5, 
size6);
+
+   final MultiDummyFileInputFormat format = new 
MultiDummyFileInputFormat();
+   format.setFilePaths(tempDir, tempDir2);
+   format.configure(new Configuration());
+   
+   BaseStatistics stats = format.getStatistics(null);
+   Assert.assertEquals("The file size from the statistics is 
wrong.", totalSize123 + totalSize456, stats.getTotalInputSize());
+   }
+   
+   @Test
+   public void testGetStatisticsMultipleOneFileWithCachedVersion() throws 
IOException {
+   final long size1 = 50873;
+   final long cachedSize = 10065;
--- End diff --

rename to `fakeSize`


> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Assignee: Fabian Hueske
>Priority: Major
>  Labels: starter
> Fix For: 1.5.0
>
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2018-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362064#comment-16362064
 ] 

ASF GitHub Bot commented on FLINK-3655:
---

Github user fhueske commented on the issue:

https://github.com/apache/flink/pull/5415
  
Thanks for the question @zentol. That might be a cleaner solution, but I 
don't think we could move much into the shared super class. Everything that's 
not `private` would need to be duplicated to maintain binary compatibility. 
However, most method are `protected` or `private`. 


> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Assignee: Fabian Hueske
>Priority: Major
>  Labels: starter
> Fix For: 1.5.0
>
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2018-02-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360838#comment-16360838
 ] 

ASF GitHub Bot commented on FLINK-3655:
---

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/5415
  
What would speak against creating a new FileInputFormat that supports 
multiple paths instead? Common code could be moved into a shared super class (I 
_think_ that would be allowed).


> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Assignee: Fabian Hueske
>Priority: Major
>  Labels: starter
> Fix For: 1.5.0
>
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2017-12-29 Thread Fabian Hueske (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16306527#comment-16306527
 ] 

Fabian Hueske commented on FLINK-3655:
--

Hi [~sjwiesman], sorry for the delay. I just had a look at the PR. 
The changes look mostly good but it breaks the public API in some places. 
IMO, it could go into Flink 1.5.0 after some adjustments.

Because it's more than 1.5 years since the PR has been opened, I'll rework it 
myself and will open a new PR with my changes put on top of the initial PR.
It would be great if you could give it a try once I've opened my PR.

Thanks, Fabian

> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Priority: Minor
>  Labels: starter
> Fix For: 1.5.0
>
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2017-12-28 Thread Seth Wiesman (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305745#comment-16305745
 ] 

Seth Wiesman commented on FLINK-3655:
-

[~fhueske] I have been looking for this feature, is there anything I can do to 
help get the pr merged ? 

> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Priority: Minor
>  Labels: starter
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2017-10-08 Thread Fabian Hueske (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196030#comment-16196030
 ] 

Fabian Hueske commented on FLINK-3655:
--

I don't think there's a particular reason why the PR hasn't been merged.
Nobody picked it up and it disappeared from the radar in the list of stale PRs 
:-/

I'll try to have a look in the next days.

> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Priority: Minor
>  Labels: starter
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2017-10-06 Thread Vishnu Viswanath (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195159#comment-16195159
 ] 

Vishnu Viswanath commented on FLINK-3655:
-

was looking for this feature. why wasn't this ever merged?

> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Priority: Minor
>  Labels: starter
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2016-05-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302164#comment-15302164
 ] 

ASF GitHub Bot commented on FLINK-3655:
---

Github user gna-phetsarath commented on the pull request:

https://github.com/apache/flink/pull/1990#issuecomment-221890183
  
You are correct, the majority of the changes were in the "generate splits" 
method and "statistics" methods which included changes to subclasses that used 
the file path directly.  Not as extensive as it appears.

Also, additional tests were added.


> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Priority: Minor
>  Labels: starter
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2016-05-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302050#comment-15302050
 ] 

ASF GitHub Bot commented on FLINK-3655:
---

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/1990#issuecomment-221866397
  
Thanks for opening that contribution.

Can you sum up the changes you made? That would make the review easier.
The changes look quite extensive. My gut feeling would be that it should 
not require so many changes, ideally only an additional loop in the "generate 
splits" method, and possibly in the "statistics" method.


> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Priority: Minor
>  Labels: starter
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2016-05-18 Thread Gna Phetsarath (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289131#comment-15289131
 ] 

Gna Phetsarath commented on FLINK-3655:
---

There's a pull request for this: https://github.com/apache/flink/pull/1990

> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Priority: Minor
>  Labels: starter
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2016-05-04 Thread Gna Phetsarath (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271314#comment-15271314
 ] 

Gna Phetsarath commented on FLINK-3655:
---

What's the progress on this ticket, [~tianli]?

> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Priority: Minor
>  Labels: starter
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2016-04-05 Thread Tian, Li (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227573#comment-15227573
 ] 

Tian, Li commented on FLINK-3655:
-

Thanks, I will do the path list first and use "readFile(FileInputFormat 
inputFormat, String.. filePaths)". 

> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Priority: Minor
>  Labels: starter
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2016-04-05 Thread Tian, Li (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227545#comment-15227545
 ] 

Tian, Li commented on FLINK-3655:
-

Will support wildcards

> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Priority: Minor
>  Labels: starter
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2016-04-05 Thread Gna Phetsarath (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15226676#comment-15226676
 ] 

Gna Phetsarath commented on FLINK-3655:
---

Will do be doing wildcards as well, or should be put that as another ticket?


> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Priority: Minor
>  Labels: starter
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2016-04-05 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15226474#comment-15226474
 ] 

Maximilian Michels commented on FLINK-3655:
---

Sounds good. It is important to maintain backwards compatibility. 

I'm not sure about the "comma-separated Path string". File names may contain 
commas. So we might skip that for now and do the path list first.

I think we could also use {{readFile(FileInputFormat inputFormat, String.. 
filePaths}} which will return the filePath as a {{String[] filepaths}} array. 

> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Priority: Minor
>  Labels: starter
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2016-04-04 Thread Tian, Li (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15224354#comment-15224354
 ] 

Tian, Li commented on FLINK-3655:
-

I think we may need to use List instead of a single Path in 
FileInputFormat.
In this way, we should also
1. modify current implementations to support multiple input paths
2. add functions like setFilePaths, getFilePaths to FileInputFormat, and 
support comma-seperated Path string in ExecutionEnvironment
3. for backward compatibility, let FileInputFormat.setFilePath set the 
inputPaths to a one-element list 

> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Priority: Minor
>  Labels: starter
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2016-04-04 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15223797#comment-15223797
 ] 

Maximilian Michels commented on FLINK-3655:
---

Hi! Great. Feel free to open a PR. The PR should include tests. Also, could you 
briefly describe how you want to integrate the feature into the existing code?

> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Priority: Minor
>  Labels: starter
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2016-04-02 Thread Tian Li (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15222930#comment-15222930
 ] 

Tian Li commented on FLINK-3655:


Hi. I would like to contribute for this issue. Thanks.

> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Priority: Minor
>  Labels: starter
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat

2016-03-24 Thread Robert Metzger (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210081#comment-15210081
 ] 

Robert Metzger commented on FLINK-3655:
---

Thank you for opening a JIRA for this feature request.
I think its a good idea and it shouldn't be too difficult to implement.

> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat
> -
>
> Key: FLINK-3655
> URL: https://issues.apache.org/jira/browse/FLINK-3655
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.0
>Reporter: Gna Phetsarath
>Priority: Minor
>  Labels: starter
>
> Allow comma-separated or multiple directories to be specified for 
> FileInputFormat so that a DataSource will process the directories 
> sequentially.
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*")
> in Scala
>env.readFile(paths: Seq[String])
> or 
>   env.readFile(path: String, otherPaths: String*)
> Wildcard support would be a bonus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)