Re: Runtime generated (source) datasets

Till Rohrmann Thu, 22 Jan 2015 06:50:56 -0800

But it not only optimizes the data flow. It also translates it into a
different representation.


On Thu, Jan 22, 2015 at 3:34 PM, Robert Metzger <[email protected]> wrote:

> How about renaming the "flink-compiler" to "flink-optimizer" ?
>
> On Wed, Jan 21, 2015 at 8:21 PM, Stephan Ewen <[email protected]> wrote:
>
>> There is a common misunderstanding between the "compile" phase of the
>> Java/Scala compiler (which does not generate the Flink plan) and the Flink
>> "compile/optimize" phase (happening when calling env.execute()).
>>
>> The Flink compile/optimize phase is not a compile phase in the sense that
>> source code is parsed and translated to byte code. It only is a set of
>> transformations on the program's data flow
>>
>> We should probably stop calling the Flink phase "compile", but simply
>> "pre-flight" or "optimize" or "prepare". Otherwise, it creates frequent
>> confusions...
>>
>> On Wed, Jan 21, 2015 at 6:05 AM, Flavio Pompermaier <[email protected]
>> > wrote:
>>
>>> Thanks Fabian, that makes a lot of sense :)
>>>
>>> Best,
>>> Flavio
>>>
>>> On Wed, Jan 21, 2015 at 2:41 PM, Fabian Hueske <[email protected]>
>>> wrote:
>>>
>>>> The program is compiled when the ExecutionEnvironment.execute() method
>>>> is called. At that moment, theEexecutionEnvironment collects all data
>>>> sources that were previously created and traverses them towards connected
>>>> data sinks. All sinks that are found this way are remembered and treated as
>>>> execution targets. The sinks and all connected operators and data sources
>>>> are given to the optimizer which analyses the plan, compiles an execution
>>>> plan, and submits the plan to the execution system which the
>>>> ExecutionEnvironment refers to (local, remote, ...).
>>>>
>>>> Therefore, your code can build arbitrary data flows with as many source
>>>> as you like. Once you call ExecutionEnvironment.execute() all data sources
>>>> and operators which are required to compute the result of all data sinks
>>>> are executed.
>>>>
>>>>
>>>> 2015-01-21 14:26 GMT+01:00 Flavio Pompermaier <[email protected]>:
>>>>
>>>>> Great! Could you explain me a little bit the internals of how and when
>>>>> Flink will generate the plan and how the execution environment is involved
>>>>> in this phase?
>>>>> Just to better understand this step!
>>>>>
>>>>> Thanks again,
>>>>> Flavio
>>>>>
>>>>>
>>>>> On Wed, Jan 21, 2015 at 2:14 PM, Till Rohrmann <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Yes this will also work. You only have to make sure that the list of
>>>>>> data sets is processed properly later on in your code.
>>>>>>
>>>>>> On Wed, Jan 21, 2015 at 2:09 PM, Flavio Pompermaier <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Till,
>>>>>>> thanks for the reply. However my problem is that I'll have something
>>>>>>> like:
>>>>>>>
>>>>>>> List<Dataset<<ElementType>>  getInput(String[] args,
>>>>>>> ExecutionEnvironment env) {....}
>>>>>>>
>>>>>>> So I don't know in advance how many of them I'll have at runtime.
>>>>>>> Does it still work?
>>>>>>>
>>>>>>> On Wed, Jan 21, 2015 at 1:55 PM, Till Rohrmann <[email protected]
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Hi Flavio,
>>>>>>>>
>>>>>>>> if your question was whether you can write a Flink job which can
>>>>>>>> read input from different sources, depending on the user input, then 
>>>>>>>> the
>>>>>>>> answer is yes. The Flink job plans are actually generated at runtime so
>>>>>>>> that you can easily write a method which generates a user dependent
>>>>>>>> input/data set.
>>>>>>>>
>>>>>>>> You could do something like this:
>>>>>>>>
>>>>>>>> DataSet<ElementType> getInput(String[] args, ExecutionEnvironment
>>>>>>>> env) {
>>>>>>>>   if(args[0] == csv) {
>>>>>>>>     return env.readCsvFile(...);
>>>>>>>>   } else {
>>>>>>>>     return env.createInput(new AvroInputFormat<ElementType>(...));
>>>>>>>>   }
>>>>>>>> }
>>>>>>>>
>>>>>>>> as long as the element type of the data set are all equal for all
>>>>>>>> possible data sources. I hope that I understood your problem correctly.
>>>>>>>>
>>>>>>>> Greets,
>>>>>>>>
>>>>>>>> Till
>>>>>>>>
>>>>>>>> On Wed, Jan 21, 2015 at 11:45 AM, Flavio Pompermaier <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi guys,
>>>>>>>>>
>>>>>>>>> I have a big question for you about how Fling handles job's plan
>>>>>>>>> generation:
>>>>>>>>> let's suppose that I want to write a job that takes as input a
>>>>>>>>> description of a set of datasets that I want to work on (for example 
>>>>>>>>> a csv
>>>>>>>>> file and its path, 2 hbase tables, 1 parquet directory and its path, 
>>>>>>>>> etc).
>>>>>>>>> From what I know Flink generates the job's plan at compile time,
>>>>>>>>> so I was wondering whether this is possible right now or not..
>>>>>>>>>
>>>>>>>>> Thanks in advance,
>>>>>>>>> Flavio
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Re: Runtime generated (source) datasets

Reply via email to