Re: Stream conversion

2016-02-05 Thread Ufuk Celebi

> On 05 Feb 2016, at 08:56, Jeyhun Karimov  wrote:
> 
> For example, I will do aggregate operations with other windows (n-window 
> aggregations) that are already outputted.
> I tried your suggestion and used filesystem sink, outputted to HDFS.
>  I got k files in HDFS directory where k is the number of parallelism (I used 
> single machine).
> These files get bigger (new records are appended) as stream continues. 
> Because they are (outputted files) not closed and file size is changed 
> regularly, would this cause some problems while processing data with dataset 
> api or hadoop or another library?

I think you have used the plain file sink and Robert was referring to the 
rolling HDFS file sink [1] This will bucket your data in different directories 
like this: /base/path/{date-time}/part-{parallel-task}-{count}

– Ufuk

[1] 
https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/connectors/hdfs.html



Re: Stream conversion

2016-02-04 Thread Stephan Ewen
Hi!

If I understand you correctly, what you are looking for is a kind of
periodic batch job, where the input data for each batch is a large window.

We have actually thought about this kind of application before. It is not
on the short term road map that we shared a few weeks ago, but I think it
will come to Flink in the mid-term (that would be in some months or so), it
is asked for quite frequently.

Implementing this as a core feature is a bit of effort. A mock that writes
out the windows and triggers a batch job sounds not too difficult, actually.

Greetings,
Stephan


On Thu, Feb 4, 2016 at 10:30 AM, Sane Lee  wrote:

> I have also, similar scenario. Any suggestion would be appreciated.
>
> On Thu, Feb 4, 2016 at 10:29 AM Jeyhun Karimov 
> wrote:
>
>> Hi Matthias,
>>
>> This need not to be necessarily in api functions. I just want to get a
>> roadmap to add this functionality. Should I save each window's data into
>> disk and create a new dataset environment in parallel? Or change trigger
>> functionality maybe?
>>
>> I have large windows. As I asked in previous question, in flink the
>> problem with large windows (that data inside windows may not fit in memory)
>> will be solved. So, after getting the data of window, I want to do more
>> than the functions in stream api. Therefore I need to convert it to
>> dataset. Any roadmap would be appreciated.
>>
>> On Thu, Feb 4, 2016 at 10:23 AM Matthias J. Sax  wrote:
>>
>>> Hi Sane,
>>>
>>> Currently, DataSet and DataStream API a strictly separated. Thus, this
>>> is not possible at the moment.
>>>
>>> What kind of operation do you want to perform on the data of a window?
>>> Why do you want to convert the data into a data set?
>>>
>>> -Matthias
>>>
>>> On 02/04/2016 10:11 AM, Sane Lee wrote:
>>> > Dear all,
>>> >
>>> > I want to convert the data from each window of stream to dataset. What
>>> > is the best way to do that?  So, while streaming, at the end of each
>>> > window I want to convert those data to dataset and possible apply
>>> > dataset transformations to it.
>>> > Any suggestions?
>>> >
>>> > -best
>>> > -sane
>>>
>>>


Re: Stream conversion

2016-02-04 Thread Jeyhun Karimov
For example, I will do aggregate operations with other windows (n-window
aggregations) that are already outputted.
I tried your suggestion and used filesystem sink, outputted to HDFS.
 I got k files in HDFS directory where k is the number of parallelism (I
used single machine).
These files get bigger (new records are appended) as stream continues.
Because they are (outputted files) not closed and file size is changed
regularly, would this cause some problems while processing data with
dataset api or hadoop or another library?



On Thu, Feb 4, 2016 at 2:14 PM Robert Metzger  wrote:

> I'm wondering which kind of transformations you want to apply to the
> window you cannot apply with the DataStream API?
>
> Would it be sufficient for you to have the windows as files in HDFS and
> then run batch jobs against the windows on disk? If so, you could use our
> filesystem sink, which creates files bucketed by certain time-windows.
>
> On Thu, Feb 4, 2016 at 11:33 AM, Stephan Ewen  wrote:
>
>> Hi!
>>
>> If I understand you correctly, what you are looking for is a kind of
>> periodic batch job, where the input data for each batch is a large window.
>>
>> We have actually thought about this kind of application before. It is not
>> on the short term road map that we shared a few weeks ago, but I think it
>> will come to Flink in the mid-term (that would be in some months or so), it
>> is asked for quite frequently.
>>
>> Implementing this as a core feature is a bit of effort. A mock that
>> writes out the windows and triggers a batch job sounds not too difficult,
>> actually.
>>
>> Greetings,
>> Stephan
>>
>>
>> On Thu, Feb 4, 2016 at 10:30 AM, Sane Lee  wrote:
>>
>>> I have also, similar scenario. Any suggestion would be appreciated.
>>>
>>> On Thu, Feb 4, 2016 at 10:29 AM Jeyhun Karimov 
>>> wrote:
>>>
 Hi Matthias,

 This need not to be necessarily in api functions. I just want to get a
 roadmap to add this functionality. Should I save each window's data into
 disk and create a new dataset environment in parallel? Or change trigger
 functionality maybe?

 I have large windows. As I asked in previous question, in flink the
 problem with large windows (that data inside windows may not fit in memory)
 will be solved. So, after getting the data of window, I want to do more
 than the functions in stream api. Therefore I need to convert it to
 dataset. Any roadmap would be appreciated.

 On Thu, Feb 4, 2016 at 10:23 AM Matthias J. Sax 
 wrote:

> Hi Sane,
>
> Currently, DataSet and DataStream API a strictly separated. Thus, this
> is not possible at the moment.
>
> What kind of operation do you want to perform on the data of a window?
> Why do you want to convert the data into a data set?
>
> -Matthias
>
> On 02/04/2016 10:11 AM, Sane Lee wrote:
> > Dear all,
> >
> > I want to convert the data from each window of stream to dataset.
> What
> > is the best way to do that?  So, while streaming, at the end of each
> > window I want to convert those data to dataset and possible apply
> > dataset transformations to it.
> > Any suggestions?
> >
> > -best
> > -sane
>
>
>>
>


Re: Stream conversion

2016-02-04 Thread Robert Metzger
I'm wondering which kind of transformations you want to apply to the window
you cannot apply with the DataStream API?

Would it be sufficient for you to have the windows as files in HDFS and
then run batch jobs against the windows on disk? If so, you could use our
filesystem sink, which creates files bucketed by certain time-windows.

On Thu, Feb 4, 2016 at 11:33 AM, Stephan Ewen  wrote:

> Hi!
>
> If I understand you correctly, what you are looking for is a kind of
> periodic batch job, where the input data for each batch is a large window.
>
> We have actually thought about this kind of application before. It is not
> on the short term road map that we shared a few weeks ago, but I think it
> will come to Flink in the mid-term (that would be in some months or so), it
> is asked for quite frequently.
>
> Implementing this as a core feature is a bit of effort. A mock that writes
> out the windows and triggers a batch job sounds not too difficult, actually.
>
> Greetings,
> Stephan
>
>
> On Thu, Feb 4, 2016 at 10:30 AM, Sane Lee  wrote:
>
>> I have also, similar scenario. Any suggestion would be appreciated.
>>
>> On Thu, Feb 4, 2016 at 10:29 AM Jeyhun Karimov 
>> wrote:
>>
>>> Hi Matthias,
>>>
>>> This need not to be necessarily in api functions. I just want to get a
>>> roadmap to add this functionality. Should I save each window's data into
>>> disk and create a new dataset environment in parallel? Or change trigger
>>> functionality maybe?
>>>
>>> I have large windows. As I asked in previous question, in flink the
>>> problem with large windows (that data inside windows may not fit in memory)
>>> will be solved. So, after getting the data of window, I want to do more
>>> than the functions in stream api. Therefore I need to convert it to
>>> dataset. Any roadmap would be appreciated.
>>>
>>> On Thu, Feb 4, 2016 at 10:23 AM Matthias J. Sax 
>>> wrote:
>>>
 Hi Sane,

 Currently, DataSet and DataStream API a strictly separated. Thus, this
 is not possible at the moment.

 What kind of operation do you want to perform on the data of a window?
 Why do you want to convert the data into a data set?

 -Matthias

 On 02/04/2016 10:11 AM, Sane Lee wrote:
 > Dear all,
 >
 > I want to convert the data from each window of stream to dataset. What
 > is the best way to do that?  So, while streaming, at the end of each
 > window I want to convert those data to dataset and possible apply
 > dataset transformations to it.
 > Any suggestions?
 >
 > -best
 > -sane


>


Re: Stream conversion

2016-02-04 Thread Matthias J. Sax
Hi Sane,

Currently, DataSet and DataStream API a strictly separated. Thus, this
is not possible at the moment.

What kind of operation do you want to perform on the data of a window?
Why do you want to convert the data into a data set?

-Matthias

On 02/04/2016 10:11 AM, Sane Lee wrote:
> Dear all,
> 
> I want to convert the data from each window of stream to dataset. What
> is the best way to do that?  So, while streaming, at the end of each
> window I want to convert those data to dataset and possible apply
> dataset transformations to it.
> Any suggestions?
> 
> -best
> -sane



signature.asc
Description: OpenPGP digital signature


Re: Stream conversion

2016-02-04 Thread Jeyhun Karimov
Hi Matthias,

This need not to be necessarily in api functions. I just want to get a
roadmap to add this functionality. Should I save each window's data into
disk and create a new dataset environment in parallel? Or change trigger
functionality maybe?

I have large windows. As I asked in previous question, in flink the problem
with large windows (that data inside windows may not fit in memory) will be
solved. So, after getting the data of window, I want to do more than the
functions in stream api. Therefore I need to convert it to dataset. Any
roadmap would be appreciated.

On Thu, Feb 4, 2016 at 10:23 AM Matthias J. Sax  wrote:

> Hi Sane,
>
> Currently, DataSet and DataStream API a strictly separated. Thus, this
> is not possible at the moment.
>
> What kind of operation do you want to perform on the data of a window?
> Why do you want to convert the data into a data set?
>
> -Matthias
>
> On 02/04/2016 10:11 AM, Sane Lee wrote:
> > Dear all,
> >
> > I want to convert the data from each window of stream to dataset. What
> > is the best way to do that?  So, while streaming, at the end of each
> > window I want to convert those data to dataset and possible apply
> > dataset transformations to it.
> > Any suggestions?
> >
> > -best
> > -sane
>
>


Re: Stream conversion

2016-02-04 Thread Sane Lee
I have also, similar scenario. Any suggestion would be appreciated.

On Thu, Feb 4, 2016 at 10:29 AM Jeyhun Karimov  wrote:

> Hi Matthias,
>
> This need not to be necessarily in api functions. I just want to get a
> roadmap to add this functionality. Should I save each window's data into
> disk and create a new dataset environment in parallel? Or change trigger
> functionality maybe?
>
> I have large windows. As I asked in previous question, in flink the
> problem with large windows (that data inside windows may not fit in memory)
> will be solved. So, after getting the data of window, I want to do more
> than the functions in stream api. Therefore I need to convert it to
> dataset. Any roadmap would be appreciated.
>
> On Thu, Feb 4, 2016 at 10:23 AM Matthias J. Sax  wrote:
>
>> Hi Sane,
>>
>> Currently, DataSet and DataStream API a strictly separated. Thus, this
>> is not possible at the moment.
>>
>> What kind of operation do you want to perform on the data of a window?
>> Why do you want to convert the data into a data set?
>>
>> -Matthias
>>
>> On 02/04/2016 10:11 AM, Sane Lee wrote:
>> > Dear all,
>> >
>> > I want to convert the data from each window of stream to dataset. What
>> > is the best way to do that?  So, while streaming, at the end of each
>> > window I want to convert those data to dataset and possible apply
>> > dataset transformations to it.
>> > Any suggestions?
>> >
>> > -best
>> > -sane
>>
>>