Re: Read a given list of HDFS folder

Maximilian Michels Tue, 29 Mar 2016 09:30:06 -0700

Hi Gwenhael,

That is not possible right now. As a workaround, you could have three
DataSets that are constructed by reading recursively from each directory
and unify these later. Alternatively, moving/linking the directories in a
different location would also work.


I agree that it would be nice to specify a pattern of files to
include/exclude. I've filed a JIRA:
https://issues.apache.org/jira/browse/FLINK-3677

Cheers,
Max


On Mon, Mar 21, 2016 at 1:51 PM, Gwenhael Pasquiers <
gwenhael.pasqui...@ericsson.com> wrote:

> Hi and thanks, i'm not sure that recurive traversal is what I need.
>
> Let's say I have the following dir tree :
>
> /data/2016_03_21_13/<files>.gz
> /data/2016_03_21_12/<files>.gz
> /data/2016_03_21_11/<files>.gz
> /data/2016_03_21_10/<files>.gz
> /data/2016_03_21_09/<files>.gz
> /data/2016_03_21_08/<files>.gz
> /data/2016_03_21_07/<files>.gz
>
>
> I want my DataSet to include (and nothing else) :
>
> /data/2016_03_21_13/*.gz
> /data/2016_03_21_12/*.gz
> /data/2016_03_21_11/*.gz
>
> And I do not want to include any of the other folders (and their files).
>
> Can I create a DataSet that would only contain those folders ?
>
> -----Original Message-----
> From: Ufuk Celebi [mailto:u...@apache.org]
> Sent: lundi 21 mars 2016 13:39
> To: user@flink.apache.org
> Subject: Re: Read a given list of HDFS folder
>
> Hey Gwenhaël,
>
> see here for recursive traversal of input paths:
>
> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#recursive-traversal-of-the-input-path-directory
>
> Regarding the phases: the best way to exchange data between batch jobs is
> via files. You can then execute two programs one after the other, the first
> one produces the files, which the second jobs uses as input.
>
> – Ufuk
>
>
>
> On Mon, Mar 21, 2016 at 12:14 PM, Gwenhael Pasquiers <
> gwenhael.pasqui...@ericsson.com> wrote:
> > Hello,
> >
> > Sorry if this has been already asked or is already in the docs, I did
> not find the answer :
> >
> > Is there a way to read a given set of folders in Flink batch ? Let's say
> we have one folder per hour of data, written by flume, and we'd like to
> read only the N last hours (or any other pattern or arbitrary list of
> folders).
> >
> > And while I'm at it I have another question :
> >
> > Let's say that in my batch task I need to sequence two "phases" and that
> the second phase needs the final result from the first one.
> >  - Do I have to create, in the TaskManager, one Execution environment
> per task and execute them one after the other ?
> >  - Can my TaskManagers send back some data (other than counters) to the
> JobManager or do I have to use a file to store the result from phase one
> and use it in phase Two ?
> >
> > Thanks in advance for your answers,
> >
> > Gwenhaël
>

Re: Read a given list of HDFS folder

Reply via email to