Re: Creating a representative streaming workload

2015-11-24 Thread Andra Lungu
Hi,

Sorry for the ultra-late reply.

Another real-life streaming scenario would be the one I am working on:
- collecting data from telecom cells in real-time
- and filtering out certain information or enriching/correlating (adding
additional info based on the parameters received) events
- this is done in order to understand what is happening in the network and
to ensure better quality of service.

As for Robert's proposal, I'd like to work on the stream generator if there
is no time constraint, but first of all I'd like to hear more details. What
kind of data are we generating? How many fields are there and of what type?
Ideally, the user calling this generator should be able to make this
decision. Can we create a JIRA for this? This way, it would be easier to
start working on the task.

Thanks!
Andra

On Wed, Nov 18, 2015 at 12:14 PM, Robert Metzger 
wrote:

> Hey Vasia,
>
> I think a very common workload would be an event stream from web servers
> of an online shop. Usually, these shops have multiple servers, so events
> arrive out of order.
> I think there are plenty of different use cases that you can build around
> that data:
> - Users perform different actions that a streaming system could track
> (analysis of click-paths),
> - some simple statistics using windows (items sold in the last 10 minutes,
> ..).
> - Maybe fraud detection would be another use case.
> - Often, there also needs to be a sink to HDFS or another file system for
> a long-term archive.
>
> I would love to see such an event generator in flink's contrib module. I
> think that's something the entire streaming space could use.
>
>
>
>
> On Mon, Nov 16, 2015 at 8:22 PM, Nick Dimiduk  wrote:
>
>> All those should apply for streaming too...
>>
>> On Mon, Nov 16, 2015 at 11:06 AM, Vasiliki Kalavri <
>> vasilikikala...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> thanks Nick and Ovidiu for the links!
>>>
>>> Just to clarify, we're not looking into creating a generic streaming
>>> benchmark. We have quite limited time and resources for this project. What
>>> we want is to decide on a set of 3-4 _common_ streaming applications. To
>>> give you an idea, for the batch workload, we will pick something like a
>>> grep, one relational application, a graph algorithm, and an ML algorithm.
>>>
>>> Cheers,
>>> -Vasia.
>>>
>>> On 16 November 2015 at 19:25, Ovidiu-Cristian MARCU <
>>> ovidiu-cristian.ma...@inria.fr> wrote:
>>>
 Regarding Flink vs Spark / Storm you can check here:
 http://www.sparkbigdata.com/102-spark-blog-slim-baltagi/14-results-of-a-benchmark-between-apache-flink-and-apache-spark

 Best regards,
 Ovidiu

 On 16 Nov 2015, at 15:21, Vasiliki Kalavri 
 wrote:

 Hello squirrels,

 with some colleagues and students here at KTH, we have started 2
 projects to evaluate (1) performance and (2) behavior in the presence of
 memory interference in cloud environments, for Flink and other systems. We
 want to provide our students with a workload of representative applications
 for testing.

 While for batch applications, it is quite clear to us what classes of
 applications are widely used and how to create a workload of different
 types of applications, we are not quite sure about the streaming workload.

 That's why, we'd like your opinions! If you're using Flink streaming in
 your company or your project, we'd love your input even more :-)

 What kind of applications would you consider as "representative" of a
 streaming workload? Have you run any experiments to evaluate Flink versus
 Spark, Storm etc.? If yes, would you mind sharing your code with us?

 We will of course be happy to share our results with everyone after we
 have completed our study.

 Thanks a lot!
 -Vasia.



>>>
>>
>


Re: Creating a representative streaming workload

2015-11-18 Thread Robert Metzger
Hey Vasia,

I think a very common workload would be an event stream from web servers of
an online shop. Usually, these shops have multiple servers, so events
arrive out of order.
I think there are plenty of different use cases that you can build around
that data:
- Users perform different actions that a streaming system could track
(analysis of click-paths),
- some simple statistics using windows (items sold in the last 10 minutes,
..).
- Maybe fraud detection would be another use case.
- Often, there also needs to be a sink to HDFS or another file system for a
long-term archive.

I would love to see such an event generator in flink's contrib module. I
think that's something the entire streaming space could use.




On Mon, Nov 16, 2015 at 8:22 PM, Nick Dimiduk  wrote:

> All those should apply for streaming too...
>
> On Mon, Nov 16, 2015 at 11:06 AM, Vasiliki Kalavri <
> vasilikikala...@gmail.com> wrote:
>
>> Hi,
>>
>> thanks Nick and Ovidiu for the links!
>>
>> Just to clarify, we're not looking into creating a generic streaming
>> benchmark. We have quite limited time and resources for this project. What
>> we want is to decide on a set of 3-4 _common_ streaming applications. To
>> give you an idea, for the batch workload, we will pick something like a
>> grep, one relational application, a graph algorithm, and an ML algorithm.
>>
>> Cheers,
>> -Vasia.
>>
>> On 16 November 2015 at 19:25, Ovidiu-Cristian MARCU <
>> ovidiu-cristian.ma...@inria.fr> wrote:
>>
>>> Regarding Flink vs Spark / Storm you can check here:
>>> http://www.sparkbigdata.com/102-spark-blog-slim-baltagi/14-results-of-a-benchmark-between-apache-flink-and-apache-spark
>>>
>>> Best regards,
>>> Ovidiu
>>>
>>> On 16 Nov 2015, at 15:21, Vasiliki Kalavri 
>>> wrote:
>>>
>>> Hello squirrels,
>>>
>>> with some colleagues and students here at KTH, we have started 2
>>> projects to evaluate (1) performance and (2) behavior in the presence of
>>> memory interference in cloud environments, for Flink and other systems. We
>>> want to provide our students with a workload of representative applications
>>> for testing.
>>>
>>> While for batch applications, it is quite clear to us what classes of
>>> applications are widely used and how to create a workload of different
>>> types of applications, we are not quite sure about the streaming workload.
>>>
>>> That's why, we'd like your opinions! If you're using Flink streaming in
>>> your company or your project, we'd love your input even more :-)
>>>
>>> What kind of applications would you consider as "representative" of a
>>> streaming workload? Have you run any experiments to evaluate Flink versus
>>> Spark, Storm etc.? If yes, would you mind sharing your code with us?
>>>
>>> We will of course be happy to share our results with everyone after we
>>> have completed our study.
>>>
>>> Thanks a lot!
>>> -Vasia.
>>>
>>>
>>>
>>
>


Re: Creating a representative streaming workload

2015-11-16 Thread Nick Dimiduk
Why not use an existing benchmarking tool -- is there one? Perhaps you'd
like to build something like YCSB [0] but for streaming workloads?

Apache Storm is the OSS framework that's been around the longest. Search
for "apache storm benchmark" and you'll get some promising hits. Looks like
IBMStreams has a tool [1] and the Ericsson research blog has a detailed
post [2] as well.

[0]: https://github.com/brianfrankcooper/YCSB
[1]:
https://github.com/IBMStreams/benchmarks/wiki/Running-Apache-Storm-benchmark
[2]:
http://www.ericsson.com/research-blog/data-knowledge/trident-benchmarking-performance/

On Mon, Nov 16, 2015 at 6:21 AM, Vasiliki Kalavri  wrote:

> Hello squirrels,
>
> with some colleagues and students here at KTH, we have started 2 projects
> to evaluate (1) performance and (2) behavior in the presence of memory
> interference in cloud environments, for Flink and other systems. We want to
> provide our students with a workload of representative applications for
> testing.
>
> While for batch applications, it is quite clear to us what classes of
> applications are widely used and how to create a workload of different
> types of applications, we are not quite sure about the streaming workload.
>
> That's why, we'd like your opinions! If you're using Flink streaming in
> your company or your project, we'd love your input even more :-)
>
> What kind of applications would you consider as "representative" of a
> streaming workload? Have you run any experiments to evaluate Flink versus
> Spark, Storm etc.? If yes, would you mind sharing your code with us?
>
> We will of course be happy to share our results with everyone after we
> have completed our study.
>
> Thanks a lot!
> -Vasia.
>


Re: Creating a representative streaming workload

2015-11-16 Thread Vasiliki Kalavri
Hi,

thanks Nick and Ovidiu for the links!

Just to clarify, we're not looking into creating a generic streaming
benchmark. We have quite limited time and resources for this project. What
we want is to decide on a set of 3-4 _common_ streaming applications. To
give you an idea, for the batch workload, we will pick something like a
grep, one relational application, a graph algorithm, and an ML algorithm.

Cheers,
-Vasia.

On 16 November 2015 at 19:25, Ovidiu-Cristian MARCU <
ovidiu-cristian.ma...@inria.fr> wrote:

> Regarding Flink vs Spark / Storm you can check here:
> http://www.sparkbigdata.com/102-spark-blog-slim-baltagi/14-results-of-a-benchmark-between-apache-flink-and-apache-spark
>
> Best regards,
> Ovidiu
>
> On 16 Nov 2015, at 15:21, Vasiliki Kalavri 
> wrote:
>
> Hello squirrels,
>
> with some colleagues and students here at KTH, we have started 2 projects
> to evaluate (1) performance and (2) behavior in the presence of memory
> interference in cloud environments, for Flink and other systems. We want to
> provide our students with a workload of representative applications for
> testing.
>
> While for batch applications, it is quite clear to us what classes of
> applications are widely used and how to create a workload of different
> types of applications, we are not quite sure about the streaming workload.
>
> That's why, we'd like your opinions! If you're using Flink streaming in
> your company or your project, we'd love your input even more :-)
>
> What kind of applications would you consider as "representative" of a
> streaming workload? Have you run any experiments to evaluate Flink versus
> Spark, Storm etc.? If yes, would you mind sharing your code with us?
>
> We will of course be happy to share our results with everyone after we
> have completed our study.
>
> Thanks a lot!
> -Vasia.
>
>
>


Re: Creating a representative streaming workload

2015-11-16 Thread Nick Dimiduk
All those should apply for streaming too...

On Mon, Nov 16, 2015 at 11:06 AM, Vasiliki Kalavri <
vasilikikala...@gmail.com> wrote:

> Hi,
>
> thanks Nick and Ovidiu for the links!
>
> Just to clarify, we're not looking into creating a generic streaming
> benchmark. We have quite limited time and resources for this project. What
> we want is to decide on a set of 3-4 _common_ streaming applications. To
> give you an idea, for the batch workload, we will pick something like a
> grep, one relational application, a graph algorithm, and an ML algorithm.
>
> Cheers,
> -Vasia.
>
> On 16 November 2015 at 19:25, Ovidiu-Cristian MARCU <
> ovidiu-cristian.ma...@inria.fr> wrote:
>
>> Regarding Flink vs Spark / Storm you can check here:
>> http://www.sparkbigdata.com/102-spark-blog-slim-baltagi/14-results-of-a-benchmark-between-apache-flink-and-apache-spark
>>
>> Best regards,
>> Ovidiu
>>
>> On 16 Nov 2015, at 15:21, Vasiliki Kalavri 
>> wrote:
>>
>> Hello squirrels,
>>
>> with some colleagues and students here at KTH, we have started 2 projects
>> to evaluate (1) performance and (2) behavior in the presence of memory
>> interference in cloud environments, for Flink and other systems. We want to
>> provide our students with a workload of representative applications for
>> testing.
>>
>> While for batch applications, it is quite clear to us what classes of
>> applications are widely used and how to create a workload of different
>> types of applications, we are not quite sure about the streaming workload.
>>
>> That's why, we'd like your opinions! If you're using Flink streaming in
>> your company or your project, we'd love your input even more :-)
>>
>> What kind of applications would you consider as "representative" of a
>> streaming workload? Have you run any experiments to evaluate Flink versus
>> Spark, Storm etc.? If yes, would you mind sharing your code with us?
>>
>> We will of course be happy to share our results with everyone after we
>> have completed our study.
>>
>> Thanks a lot!
>> -Vasia.
>>
>>
>>
>