Re: Runtime generated (source) datasets

Flavio Pompermaier Wed, 21 Jan 2015 06:08:04 -0800

Thanks Fabian, that makes a lot of sense :)

Best,
Flavio


On Wed, Jan 21, 2015 at 2:41 PM, Fabian Hueske <[email protected]> wrote:

> The program is compiled when the ExecutionEnvironment.execute() method is
> called. At that moment, theEexecutionEnvironment collects all data sources
> that were previously created and traverses them towards connected data
> sinks. All sinks that are found this way are remembered and treated as
> execution targets. The sinks and all connected operators and data sources
> are given to the optimizer which analyses the plan, compiles an execution
> plan, and submits the plan to the execution system which the
> ExecutionEnvironment refers to (local, remote, ...).
>
> Therefore, your code can build arbitrary data flows with as many source as
> you like. Once you call ExecutionEnvironment.execute() all data sources and
> operators which are required to compute the result of all data sinks are
> executed.
>
>
> 2015-01-21 14:26 GMT+01:00 Flavio Pompermaier <[email protected]>:
>
>> Great! Could you explain me a little bit the internals of how and when
>> Flink will generate the plan and how the execution environment is involved
>> in this phase?
>> Just to better understand this step!
>>
>> Thanks again,
>> Flavio
>>
>>
>> On Wed, Jan 21, 2015 at 2:14 PM, Till Rohrmann <[email protected]>
>> wrote:
>>
>>> Yes this will also work. You only have to make sure that the list of
>>> data sets is processed properly later on in your code.
>>>
>>> On Wed, Jan 21, 2015 at 2:09 PM, Flavio Pompermaier <
>>> [email protected]> wrote:
>>>
>>>> Hi Till,
>>>> thanks for the reply. However my problem is that I'll have something
>>>> like:
>>>>
>>>> List<Dataset<<ElementType>>  getInput(String[] args,
>>>> ExecutionEnvironment env) {....}
>>>>
>>>> So I don't know in advance how many of them I'll have at runtime. Does
>>>> it still work?
>>>>
>>>> On Wed, Jan 21, 2015 at 1:55 PM, Till Rohrmann <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Flavio,
>>>>>
>>>>> if your question was whether you can write a Flink job which can read
>>>>> input from different sources, depending on the user input, then the answer
>>>>> is yes. The Flink job plans are actually generated at runtime so that you
>>>>> can easily write a method which generates a user dependent input/data set.
>>>>>
>>>>> You could do something like this:
>>>>>
>>>>> DataSet<ElementType> getInput(String[] args, ExecutionEnvironment env)
>>>>> {
>>>>>   if(args[0] == csv) {
>>>>>     return env.readCsvFile(...);
>>>>>   } else {
>>>>>     return env.createInput(new AvroInputFormat<ElementType>(...));
>>>>>   }
>>>>> }
>>>>>
>>>>> as long as the element type of the data set are all equal for all
>>>>> possible data sources. I hope that I understood your problem correctly.
>>>>>
>>>>> Greets,
>>>>>
>>>>> Till
>>>>>
>>>>> On Wed, Jan 21, 2015 at 11:45 AM, Flavio Pompermaier <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> I have a big question for you about how Fling handles job's plan
>>>>>> generation:
>>>>>> let's suppose that I want to write a job that takes as input a
>>>>>> description of a set of datasets that I want to work on (for example a 
>>>>>> csv
>>>>>> file and its path, 2 hbase tables, 1 parquet directory and its path, 
>>>>>> etc).
>>>>>> From what I know Flink generates the job's plan at compile time, so I
>>>>>> was wondering whether this is possible right now or not..
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Flavio
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>

Re: Runtime generated (source) datasets

Reply via email to