Hi Gwenhael, That is not possible right now. As a workaround, you could have three DataSets that are constructed by reading recursively from each directory and unify these later. Alternatively, moving/linking the directories in a different location would also work.
I agree that it would be nice to specify a pattern of files to include/exclude. I've filed a JIRA: https://issues.apache.org/jira/browse/FLINK-3677 Cheers, Max On Mon, Mar 21, 2016 at 1:51 PM, Gwenhael Pasquiers < gwenhael.pasqui...@ericsson.com> wrote: > Hi and thanks, i'm not sure that recurive traversal is what I need. > > Let's say I have the following dir tree : > > /data/2016_03_21_13/<files>.gz > /data/2016_03_21_12/<files>.gz > /data/2016_03_21_11/<files>.gz > /data/2016_03_21_10/<files>.gz > /data/2016_03_21_09/<files>.gz > /data/2016_03_21_08/<files>.gz > /data/2016_03_21_07/<files>.gz > > > I want my DataSet to include (and nothing else) : > > /data/2016_03_21_13/*.gz > /data/2016_03_21_12/*.gz > /data/2016_03_21_11/*.gz > > And I do not want to include any of the other folders (and their files). > > Can I create a DataSet that would only contain those folders ? > > -----Original Message----- > From: Ufuk Celebi [mailto:u...@apache.org] > Sent: lundi 21 mars 2016 13:39 > To: user@flink.apache.org > Subject: Re: Read a given list of HDFS folder > > Hey Gwenhaël, > > see here for recursive traversal of input paths: > > https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#recursive-traversal-of-the-input-path-directory > > Regarding the phases: the best way to exchange data between batch jobs is > via files. You can then execute two programs one after the other, the first > one produces the files, which the second jobs uses as input. > > – Ufuk > > > > On Mon, Mar 21, 2016 at 12:14 PM, Gwenhael Pasquiers < > gwenhael.pasqui...@ericsson.com> wrote: > > Hello, > > > > Sorry if this has been already asked or is already in the docs, I did > not find the answer : > > > > Is there a way to read a given set of folders in Flink batch ? Let's say > we have one folder per hour of data, written by flume, and we'd like to > read only the N last hours (or any other pattern or arbitrary list of > folders). > > > > And while I'm at it I have another question : > > > > Let's say that in my batch task I need to sequence two "phases" and that > the second phase needs the final result from the first one. > > - Do I have to create, in the TaskManager, one Execution environment > per task and execute them one after the other ? > > - Can my TaskManagers send back some data (other than counters) to the > JobManager or do I have to use a file to store the result from phase one > and use it in phase Two ? > > > > Thanks in advance for your answers, > > > > Gwenhaël >