Hey, You can find some ml examples like LinerRegression [1, 2] or KMeans [3, 4] in the examples package in both java and scala as a quickstart.
[1] https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/ml/LinearRegression.java [2] https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/ml/LinearRegression.scala [3] https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/clustering/KMeans.java [4] https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/clustering/KMeans.scala On Fri, Dec 26, 2014 at 7:31 AM, Samarth Mailinglist < [email protected]> wrote: > Thank you the answers, folks. > Can anyone provide me a link for any implementation of an ML algorithm on > Flink? > > On Thu, Dec 25, 2014 at 8:07 PM, Gyula Fóra <[email protected]> wrote: > >> Hey, >> >> 1-2. As for failure recovery, there is a difference how the Flink batch >> and streaming programs handle failures. The failed parts of the batch jobs >> currently restart upon failures but there is an ongoing effort on fine >> grained fault tolerance which is somewhat similar to sparks lineage >> tracking. (so technically this is exactly once semantic but that is >> somewhat meaningless for batch jobs) >> >> For streaming programs we are currently working on fault tolerance, we >> are hoping to support at least once processing guarantees in the 0.9 >> release. After that we will focus our research efforts on an high >> performance implementation of exactly once processing semantics, which is >> still a hard topic in streaming systems. Storm's trident's exaclty once >> semantics can only provide very low throughput while we are trying hard to >> avoid this issue, as our streaming system is capable of much higher >> throughput than storm in general as you can see on some perf measurements. >> >> 3. There are already many ml algorithms implemented for Flink but they >> are scattered all around. We are planning to collect them in a machine >> learning library soon. We are also implementing an adapter for Samoa which >> will provide some streaming machine learning algorithms as well. Samoa >> integration should be ready in January. >> >> 4. Flink carefully manages its memory use to avoid heap errors, and >> utilizing memory space as effectively as it can. The optimizer for batch >> programs also takes care of a lot of optimization steps that the user would >> manually have to do in other system, like optimizing the order of >> transformations etc. There are of course parts of the program that still >> needs to modified for maximal performance, for example parallelism settings >> for some operators in some cases. >> >> 5. As for the status of the Python API I personally cannot say very much, >> maybe someone can jump in and help me with that question :) >> >> Regards, >> Gyula >> >> On Thu, Dec 25, 2014 at 11:58 AM, Samarth Mailinglist < >> [email protected]> wrote: >> >>> Thank you for your answer. I have a couple of follow up questions: >>> 1. Does it support 'exactly once semantics' that Spark and Storm support? >>> 2. (Related to 1) What happens when an error occurs during processing? >>> 3. Is there a plan for adding Machine Learning support on top of Flink? >>> Say Alternative Least Squares, Basic Naive Bayes? >>> 4. When you say Flink manages itself, does it mean I don't have to >>> fiddle with number of partitions (Spark), number of reduces / happers >>> (Hadoop?) to optimize performance? (In some cases this might be needed) >>> 5. How far along is the Python API? I don't see the specs in the >>> Website. >>> >>> On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <[email protected]> >>> wrote: >>> >>>> Dear Samarth, >>>> >>>> Besides the discussions you have mentioned [1] I can recommend one of >>>> our recent presentations [2], especially the distinguishing Flink section >>>> (from slide 16). >>>> >>>> It is generally a difficult question as both the systems are rapidly >>>> evolving, so the answer can become outdated quite fast. However there are >>>> fundamental design features that are highly unlikely to change, for example >>>> Spark uses "true" batch processing, meaning that intermediate results are >>>> materialized (mostly in memory) as RDDs. Flink's engine is internally more >>>> like streaming, forwarding the results to the next operator asap. The >>>> latter can yield performance benefits for more complex jobs. Flink also >>>> gives you a query optimizer, spills gracefully to disk when the system runs >>>> out of memory and has some cool features around serialization. For >>>> performance numbers and some more insight please check out the presentation >>>> [2] and do not hesitate to post a follow-up mail here if you come across >>>> something unclear or extraordinary. >>>> >>>> [1] >>>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark >>>> [2] http://www.slideshare.net/GyulaFra/flink-apachecon >>>> >>>> Best, >>>> >>>> Marton >>>> >>>> On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist < >>>> [email protected]> wrote: >>>> >>>>> Hey folks, I have a noob question. >>>>> >>>>> I already looked up the archives and saw a couple of discussions >>>>> <http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark> >>>>> about Spark and Flink. >>>>> >>>>> I am familiar with spark (the python API, esp MLLib), and I see many >>>>> similarities between Flink and Spark. >>>>> >>>>> How does Flink distinguish itself from Spark? >>>>> >>>> >>>> >>> >> >
