Re: Runtime generated (source) datasets

Plamen L. Simeonov Thu, 22 Jan 2015 04:27:55 -0800

Absolutely correct and good idea! Why don’t call it a „digester“, taking the 
term from chemistry/ medicine:


Chemistry A vessel („pipeline“) in which substances are softened or decomposed, 
usually for further processing.

All the best!


___________________________________

Dr.-Ing. Plamen L. Simeonov
Department 1: Geodäsie und Fernerkundung
Sektion 1.5: Geoinformatik
Tel.: +49 (0)331/288-1587
Fax:  +49 (0)331/288-1732
email: [email protected]
http://www.gfz-potsdam.de/
___________________________________

Helmholtz-Zentrum Potsdam
Deutsches GeoForschungsZentrum - GFZ
Stiftung des öff. Rechts Land Brandenburg
Telegrafenberg A 20, 14473 Potsdam
**************************************************




> On 21 Jan 2015, at 20:21, Stephan Ewen <[email protected]> wrote:
> 
> There is a common misunderstanding between the "compile" phase of the 
> Java/Scala compiler (which does not generate the Flink plan) and the Flink 
> "compile/optimize" phase (happening when calling env.execute()).
> 
> The Flink compile/optimize phase is not a compile phase in the sense that 
> source code is parsed and translated to byte code. It only is a set of 
> transformations on the program's data flow
> 
> We should probably stop calling the Flink phase "compile", but simply 
> "pre-flight" or "optimize" or "prepare". Otherwise, it creates frequent 
> confusions...
> 
> On Wed, Jan 21, 2015 at 6:05 AM, Flavio Pompermaier <[email protected] 
> <mailto:[email protected]>> wrote:
> Thanks Fabian, that makes a lot of sense :)
> 
> Best,
> Flavio
> 
> On Wed, Jan 21, 2015 at 2:41 PM, Fabian Hueske <[email protected] 
> <mailto:[email protected]>> wrote:
> The program is compiled when the ExecutionEnvironment.execute() method is 
> called. At that moment, theEexecutionEnvironment collects all data sources 
> that were previously created and traverses them towards connected data sinks. 
> All sinks that are found this way are remembered and treated as execution 
> targets. The sinks and all connected operators and data sources are given to 
> the optimizer which analyses the plan, compiles an execution plan, and 
> submits the plan to the execution system which the ExecutionEnvironment 
> refers to (local, remote, ...).  
> 
> Therefore, your code can build arbitrary data flows with as many source as 
> you like. Once you call ExecutionEnvironment.execute() all data sources and 
> operators which are required to compute the result of all data sinks are 
> executed.
> 
> 
> 2015-01-21 14:26 GMT+01:00 Flavio Pompermaier <[email protected] 
> <mailto:[email protected]>>:
> Great! Could you explain me a little bit the internals of how and when Flink 
> will generate the plan and how the execution environment is involved in this 
> phase? 
> Just to better understand this step!
> 
> Thanks again,
> Flavio
> 
> 
> On Wed, Jan 21, 2015 at 2:14 PM, Till Rohrmann <[email protected] 
> <mailto:[email protected]>> wrote:
> Yes this will also work. You only have to make sure that the list of data 
> sets is processed properly later on in your code.
> 
> On Wed, Jan 21, 2015 at 2:09 PM, Flavio Pompermaier <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Till,
> thanks for the reply. However my problem is that I'll have something like:
> 
> List<Dataset<<ElementType>>  getInput(String[] args, ExecutionEnvironment 
> env) {....}
> 
> So I don't know in advance how many of them I'll have at runtime. Does it 
> still work?
> 
> On Wed, Jan 21, 2015 at 1:55 PM, Till Rohrmann <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Flavio,
> 
> if your question was whether you can write a Flink job which can read input 
> from different sources, depending on the user input, then the answer is yes. 
> The Flink job plans are actually generated at runtime so that you can easily 
> write a method which generates a user dependent input/data set.
> 
> You could do something like this:
> 
> DataSet<ElementType> getInput(String[] args, ExecutionEnvironment env) {
>   if(args[0] == csv) {
>     return env.readCsvFile(...);
>   } else {
>     return env.createInput(new AvroInputFormat<ElementType>(...));
>   }
> }
> 
> as long as the element type of the data set are all equal for all possible 
> data sources. I hope that I understood your problem correctly.
> 
> Greets,
> 
> Till
> 
> On Wed, Jan 21, 2015 at 11:45 AM, Flavio Pompermaier <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi guys,
> 
> 
> 
> 
> I have a big question for you about how Fling handles job's plan generation:
> let's suppose that I want to write a job that takes as input a description of 
> a set of datasets that I want to work on (for example a csv file and its 
> path, 2 hbase tables, 1 parquet directory and its path, etc). 
> From what I know Flink generates the job's plan at compile time, so I was 
> wondering whether this is possible right now or not..
> 
> Thanks in advance,
> Flavio
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

Re: Runtime generated (source) datasets

Reply via email to