SLIDES: Overview of Apache Flink: Next-Gen Big Data Analytics Framework

2015-07-06 Thread Slim Baltagi
Hi This is the link *http://goo.gl/gVOSp8* to the slides of my talk on June 30, 2015 at the Chicago Apache Flink meetup. Although most of the current buzz is about Apache Spark, the talk shows how Apache Flink offers the only hybrid open source (Real-Time Streaming + Batch) distributed data proc

答复: In windows8 + VitualBox, how to build Flink development environment?

2015-07-06 Thread Chenliang (Liang, DataSight)
Dear Aljoscha Thanks for your response. Done for IntelliJ. Regards Liang 发件人: Aljoscha Krettek [mailto:aljos...@apache.org] 发送时间: 2015年7月6日 15:58 收件人: user@flink.apache.org 主题: Re: In windows8 + VitualBox, how to build Flink development environment? Hi, most Flink developers use IntelliJ so it i

Re: Benchmark results between Flink and Spark

2015-07-06 Thread Hawin Jiang
Hi Stephan Yes. You are correct. It looks like the TPCx-HS is an industry standard for big data. But how to get a Flink number on that. I think it is also difficult to get a Spark performance number based on TPCx-HS. if you know someone can provide servers for performance testing. I would like t

Re: Benchmark results between Flink and Spark

2015-07-06 Thread Slim Baltagi
Hi Vasia, thanks for sharing. 1. I would like to add a couple resources about *BigBench*, the Big Data benchmark suite that you are referring to: https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench and also http://blog.cloudera.com/blog/2014/11/bigbench-toward-an-industry-standar

Re: Benchmark results between Flink and Spark

2015-07-06 Thread Vasiliki Kalavri
Hi, Apart from the amplab benchmark, you might also find [1] and [2] interesting. The first is a survey on existing benchmarks, while the second proposes one. However, they are also limited to SQL-like queries. Regarding graph processing benchmarks, I recently came across Graphalytics [3]. The be

Re: Benchmark results between Flink and Spark

2015-07-06 Thread Slim Baltagi
Hi Hawin What you shared is not 'the Spark benchmark'. This benchmark measures response time on a handful of relational queries of different tools including Shark. Shark development was ended a year ago on July 1, 2014 in favor of Spark SQL which graduated from an alpha project on March 13, 2015.

Re: Benchmark results between Flink and Spark

2015-07-06 Thread Stephan Ewen
Hi Hawin! The benchmark you refer to is a more or less pure SQL benchmark. For systems that are designed for exactly the "beyond SQL" applications (streaming, iterative algorithms, UDFs, ...), this benchmark is probably not very meaningful, as it covers not one of these areas. Even in the SQL an

Re: Benchmark results between Flink and Spark

2015-07-06 Thread Hawin Jiang
Hi Slim and Fabian Here is the Spark benchmark. https://amplab.cs.berkeley.edu/benchmark/ Do we have s similar report or comparison like that. Thanks. Best regards Hawin On Mon, Jul 6, 2015 at 6:32 AM, Slim Baltagi wrote: > Hi Fabian > > > I could not find which versions of Flink and Spar

RE: Benchmark results between Flink and Spark

2015-07-06 Thread Wang, Yanping
Hi, I am new to Flink community. I am interested in comparing Spark’s feature and performance vs. Spark. Does anyone know if there is any benchmark or test available for testing Spark performance on servers that has 32 plus cores and 256GB plus memory? Thanks -yanping From: Fabian Hueske [mai

Re: Benchmark results between Flink and Spark

2015-07-06 Thread Slim Baltagi
Hi Fabian > I could not find which versions of Flink and Spark were compared. According to Norman Spangenberg, one of the authors of the conference paper, the benchmark used *Spark* version was *1.2.0*. and *Flink* version was *0.8.0*. I did ask him a few more questions about the benchmark betwee

Re: fault tolerance model for stateful streams

2015-07-06 Thread Stephan Ewen
Hi Nathan! The state is stored in a configurable "state backend". The state backend itself must be fault tolerant, like HDFS, HBase, Cassandra, Ignite, ... What the highly available Flink version does is to store the "StateHandle" in Zookeeper. The StateHandle is the metadata that points to the s

Re: Tuple model project

2015-07-06 Thread Flavio Pompermaier
Do you think it could be a good idea to extract Flink tuples in a separate project so that to allow simpler dependency management in Flin-compatible projects? On Mon, Jul 6, 2015 at 11:06 AM, Fabian Hueske wrote: > Hi, > > at the moment, Tuples are more efficient than POJOs, because POJO fields

Re: Tuple model project

2015-07-06 Thread Fabian Hueske
Hi, at the moment, Tuples are more efficient than POJOs, because POJO fields are accessed via Java reflection whereas Tuple fields are directly accessed. This performance penalty could be overcome by code-generated seriliazers and comparators but I am not aware of any work in that direction. Best

Tuple model project

2015-07-06 Thread Flavio Pompermaier
Hi to all, I was thinking to write my own flink-compatible library and I need basically a Tuple5. Is there any performace loss in using a POJO with 5 String fields vs a Tuple5? If yes, wouldn't be a good idea to extract flink tuples in a separate simple project (e.g. flink-java-tuples) that has no

Re: POJO coCroup on null value

2015-07-06 Thread Fabian Hueske
In fact you can implement own composite data types (like Tuple, Pojo) that can deal with nullable fields as keys but you need custom serializers and comparators for that. These types won't be as efficient as types that cannot handle null fields. Cheers, Fabian 2015-07-02 20:17 GMT+02:00 Flavio Po

Re: fault tolerance model for stateful streams

2015-07-06 Thread Nathan Forager
thanks for the information Aijoshcha! i'd love to better understand what the long term solution is for fault tolerance here. is the idea that zookeeper will be used to store the stream state? or the idea is that we can efficiently use hdfs? or you are designing your own key/value persistent storag

RE: data conversion between flink and "other" paradigms

2015-07-06 Thread Bill Sparks
Fabian. Thanks for the info and pointer to python. I'll check it out. -Bill From: Fabian Hueske [fhue...@gmail.com] Sent: Monday, July 06, 2015 3:23 AM To: user@flink.apache.org Subject: Re: data conversion between flink and "other" paradigms Hi Bill, a Dat

Re: data conversion between flink and "other" paradigms

2015-07-06 Thread Fabian Hueske
Hi Bill, a DataSet is just a logical concept in Flink. DataSets are often not persisted and just streamed along operators. At the moment, there is no way to access an intermediate DataSet of a Flink program directly (this might change in the future). You can process data in another function by im

Re: fault tolerance model for stateful streams

2015-07-06 Thread Aljoscha Krettek
Hi, good questions, about 1. you are right, when the JobManager fails the state is lost. Ufuk, Till and Stephan are currently working on making the JobManager fault tolerant by having hot-standby JobManagers and storing the important JobManager state in ZooKeeper. Maybe they can further comment on

Re: 答复: In windows8 + VitualBox, how to build Flink development environment?

2015-07-06 Thread Till Rohrmann
Hi Chenliang, most of the Flink committers are using IntelliJ, because it can better handle the mixed Java/Scala modules. I personally would also recommend you to do so. If you can't select a maven project, then you should check whether the maven plugin was installed correctly. You can see which

Re: In windows8 + VitualBox, how to build Flink development environment?

2015-07-06 Thread Aljoscha Krettek
Hi, most Flink developers use IntelliJ so it is probably easier for them to help you with problems if you use IntelliJ. Also, IntelliJ is easier to setup and works better for Flink because of mixed Java/Scala code. Cheers, Aljoscha On Mon, 6 Jul 2015 at 03:39 Chenliang (Liang, DataSight) < chenli

fault tolerance model for stateful streams

2015-07-06 Thread Nathan Forager
hi there, I noticed the 0.9 release announces exactly-once semantics for streams. I looked at the user guide and the primary mechanism for recovery appears to be checkpointing of user state. I have a few questions: 1. The default behavior is that checkpoints are kept in memory on the JobManager.

data conversion between flink and "other" paradigms

2015-07-06 Thread Bill Sparks
Just a question if there was some prior-art here. Just say someone wanted to use flink for processing, but at some point they wanted to call another function via say JNI/C which doesn't understand DataSet's. How would one go about this ... I'm assuming the code would have to convert the data to