Re: Runtime generated (source) datasets

Robert Metzger Thu, 22 Jan 2015 06:35:31 -0800

How about renaming the "flink-compiler" to "flink-optimizer" ?


On Wed, Jan 21, 2015 at 8:21 PM, Stephan Ewen <[email protected]> wrote:

> There is a common misunderstanding between the "compile" phase of the
> Java/Scala compiler (which does not generate the Flink plan) and the Flink
> "compile/optimize" phase (happening when calling env.execute()).
>
> The Flink compile/optimize phase is not a compile phase in the sense that
> source code is parsed and translated to byte code. It only is a set of
> transformations on the program's data flow
>
> We should probably stop calling the Flink phase "compile", but simply
> "pre-flight" or "optimize" or "prepare". Otherwise, it creates frequent
> confusions...
>
> On Wed, Jan 21, 2015 at 6:05 AM, Flavio Pompermaier <[email protected]>
> wrote:
>
>> Thanks Fabian, that makes a lot of sense :)
>>
>> Best,
>> Flavio
>>
>> On Wed, Jan 21, 2015 at 2:41 PM, Fabian Hueske <[email protected]> wrote:
>>
>>> The program is compiled when the ExecutionEnvironment.execute() method
>>> is called. At that moment, theEexecutionEnvironment collects all data
>>> sources that were previously created and traverses them towards connected
>>> data sinks. All sinks that are found this way are remembered and treated as
>>> execution targets. The sinks and all connected operators and data sources
>>> are given to the optimizer which analyses the plan, compiles an execution
>>> plan, and submits the plan to the execution system which the
>>> ExecutionEnvironment refers to (local, remote, ...).
>>>
>>> Therefore, your code can build arbitrary data flows with as many source
>>> as you like. Once you call ExecutionEnvironment.execute() all data sources
>>> and operators which are required to compute the result of all data sinks
>>> are executed.
>>>
>>>
>>> 2015-01-21 14:26 GMT+01:00 Flavio Pompermaier <[email protected]>:
>>>
>>>> Great! Could you explain me a little bit the internals of how and when
>>>> Flink will generate the plan and how the execution environment is involved
>>>> in this phase?
>>>> Just to better understand this step!
>>>>
>>>> Thanks again,
>>>> Flavio
>>>>
>>>>
>>>> On Wed, Jan 21, 2015 at 2:14 PM, Till Rohrmann <[email protected]>
>>>> wrote:
>>>>
>>>>> Yes this will also work. You only have to make sure that the list of
>>>>> data sets is processed properly later on in your code.
>>>>>
>>>>> On Wed, Jan 21, 2015 at 2:09 PM, Flavio Pompermaier <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Till,
>>>>>> thanks for the reply. However my problem is that I'll have something
>>>>>> like:
>>>>>>
>>>>>> List<Dataset<<ElementType>>  getInput(String[] args,
>>>>>> ExecutionEnvironment env) {....}
>>>>>>
>>>>>> So I don't know in advance how many of them I'll have at runtime.
>>>>>> Does it still work?
>>>>>>
>>>>>> On Wed, Jan 21, 2015 at 1:55 PM, Till Rohrmann <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Flavio,
>>>>>>>
>>>>>>> if your question was whether you can write a Flink job which can
>>>>>>> read input from different sources, depending on the user input, then the
>>>>>>> answer is yes. The Flink job plans are actually generated at runtime so
>>>>>>> that you can easily write a method which generates a user dependent
>>>>>>> input/data set.
>>>>>>>
>>>>>>> You could do something like this:
>>>>>>>
>>>>>>> DataSet<ElementType> getInput(String[] args, ExecutionEnvironment
>>>>>>> env) {
>>>>>>>   if(args[0] == csv) {
>>>>>>>     return env.readCsvFile(...);
>>>>>>>   } else {
>>>>>>>     return env.createInput(new AvroInputFormat<ElementType>(...));
>>>>>>>   }
>>>>>>> }
>>>>>>>
>>>>>>> as long as the element type of the data set are all equal for all
>>>>>>> possible data sources. I hope that I understood your problem correctly.
>>>>>>>
>>>>>>> Greets,
>>>>>>>
>>>>>>> Till
>>>>>>>
>>>>>>> On Wed, Jan 21, 2015 at 11:45 AM, Flavio Pompermaier <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi guys,
>>>>>>>>
>>>>>>>> I have a big question for you about how Fling handles job's plan
>>>>>>>> generation:
>>>>>>>> let's suppose that I want to write a job that takes as input a
>>>>>>>> description of a set of datasets that I want to work on (for example a 
>>>>>>>> csv
>>>>>>>> file and its path, 2 hbase tables, 1 parquet directory and its path, 
>>>>>>>> etc).
>>>>>>>> From what I know Flink generates the job's plan at compile time, so
>>>>>>>> I was wondering whether this is possible right now or not..
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>> Flavio
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>
>

Re: Runtime generated (source) datasets

Reply via email to