Re: Flink and Spark

Márton Balassi Fri, 26 Dec 2014 01:16:03 -0800

Hey,

You can find some ml examples like LinerRegression [1, 2] or KMeans [3, 4]
in the examples package in both java and scala as a quickstart.


[1]
https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/ml/LinearRegression.java
[2]
https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/ml/LinearRegression.scala
[3]
https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/clustering/KMeans.java
[4]
https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/clustering/KMeans.scala

On Fri, Dec 26, 2014 at 7:31 AM, Samarth Mailinglist <
[email protected]> wrote:

> Thank you the answers, folks.
> Can anyone provide me a link for any implementation of an ML algorithm on
> Flink?
>
> On Thu, Dec 25, 2014 at 8:07 PM, Gyula Fóra <[email protected]> wrote:
>
>> Hey,
>>
>> 1-2. As for failure recovery, there is a difference how the Flink batch
>> and streaming programs handle failures. The failed parts of the batch jobs
>> currently restart upon failures but there is an ongoing effort on fine
>> grained fault tolerance which is somewhat similar to sparks lineage
>> tracking. (so technically this is exactly once semantic but that is
>> somewhat meaningless for batch jobs)
>>
>> For streaming programs we are currently working on fault tolerance, we
>> are hoping to support at least once processing guarantees in the 0.9
>> release. After that we will focus our research efforts on an high
>> performance implementation of exactly once processing semantics, which is
>> still a hard topic in streaming systems. Storm's trident's exaclty once
>> semantics can only provide very low throughput while we are trying hard to
>> avoid this issue, as our streaming system is capable of much higher
>> throughput than storm in general as you can see on some perf measurements.
>>
>> 3. There are already many ml algorithms implemented for Flink but they
>> are scattered all around. We are planning to collect them in a machine
>> learning library soon. We are also implementing an adapter for Samoa which
>> will provide some streaming machine learning algorithms as well. Samoa
>> integration should be ready in January.
>>
>> 4. Flink carefully manages its memory use to avoid heap errors, and
>> utilizing memory space as effectively as it can. The optimizer for batch
>> programs also takes care of a lot of optimization steps that the user would
>> manually have to do in other system, like optimizing the order of
>> transformations etc. There are of course parts of the program that still
>> needs to modified for maximal performance, for example parallelism settings
>> for some operators in some cases.
>>
>> 5. As for the status of the Python API I personally cannot say very much,
>> maybe someone can jump in and help me with that question :)
>>
>> Regards,
>> Gyula
>>
>> On Thu, Dec 25, 2014 at 11:58 AM, Samarth Mailinglist <
>> [email protected]> wrote:
>>
>>> Thank you for your answer. I have a couple of follow up questions:
>>> 1. Does it support 'exactly once semantics' that Spark and Storm support?
>>> 2. (Related to 1) What happens when an error occurs during processing?
>>> 3. Is there a plan for adding Machine Learning support on top of Flink?
>>> Say Alternative Least Squares, Basic Naive Bayes?
>>> 4. When you say Flink manages itself, does it mean I don't have to
>>> fiddle with number of partitions (Spark), number of reduces / happers
>>> (Hadoop?) to optimize performance? (In some cases this might be needed)
>>> 5. How far along is the Python API? I don't see the specs in the
>>> Website.
>>>
>>> On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <[email protected]>
>>> wrote:
>>>
>>>> Dear Samarth,
>>>>
>>>> Besides the discussions you have mentioned [1] I can recommend one of
>>>> our recent presentations [2], especially the distinguishing Flink section
>>>> (from slide 16).
>>>>
>>>> It is generally a difficult question as both the systems are rapidly
>>>> evolving, so the answer can become outdated quite fast. However there are
>>>> fundamental design features that are highly unlikely to change, for example
>>>> Spark uses "true" batch processing, meaning that intermediate results are
>>>> materialized (mostly in memory) as RDDs. Flink's engine is internally more
>>>> like streaming, forwarding the results to the next operator asap. The
>>>> latter can yield performance benefits for more complex jobs. Flink also
>>>> gives you a query optimizer, spills gracefully to disk when the system runs
>>>> out of memory and has some cool features around serialization. For
>>>> performance numbers and some more insight please check out the presentation
>>>> [2] and do not hesitate to post a follow-up mail here if you come across
>>>> something unclear or extraordinary.
>>>>
>>>> [1]
>>>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark
>>>> [2] http://www.slideshare.net/GyulaFra/flink-apachecon
>>>>
>>>> Best,
>>>>
>>>> Marton
>>>>
>>>> On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist <
>>>> [email protected]> wrote:
>>>>
>>>>> Hey folks, I have a noob question.
>>>>>
>>>>> I already looked up the archives and saw a couple of discussions
>>>>> <http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark>
>>>>> about Spark and Flink.
>>>>>
>>>>> I am familiar with spark (the python API, esp MLLib), and I see many
>>>>> similarities between Flink and Spark.
>>>>>
>>>>> How does Flink distinguish itself from Spark?
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Flink and Spark

Reply via email to