Hi Samarth, you can also find different implementations of ALS on Flink here: https://github.com/project-flink/flink-perf/tree/master/flink-jobs/src/main/scala/com/github/projectflink/als .
On Fri, Dec 26, 2014 at 4:12 PM, Robert Metzger <[email protected]> wrote: > For the Python API, there is a pending pull request: > https://github.com/apache/incubator-flink/pull/202 It is still work in > progress, but feedback is, as always, appreciated. > > > > On Fri, Dec 26, 2014 at 3:41 PM, Samarth Mailinglist < > [email protected]> wrote: > >> Thanks a lot Márton and Gyula! >> >> On Fri, Dec 26, 2014 at 2:42 PM, Márton Balassi <[email protected] >> > wrote: >> >>> Hey, >>> >>> You can find some ml examples like LinerRegression [1, 2] or KMeans [3, >>> 4] in the examples package in both java and scala as a quickstart. >>> >>> [1] >>> https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/ml/LinearRegression.java >>> [2] >>> https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/ml/LinearRegression.scala >>> [3] >>> https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/clustering/KMeans.java >>> [4] >>> https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/clustering/KMeans.scala >>> >>> On Fri, Dec 26, 2014 at 7:31 AM, Samarth Mailinglist < >>> [email protected]> wrote: >>> >>>> Thank you the answers, folks. >>>> Can anyone provide me a link for any implementation of an ML algorithm >>>> on Flink? >>>> >>>> On Thu, Dec 25, 2014 at 8:07 PM, Gyula Fóra <[email protected]> wrote: >>>> >>>>> Hey, >>>>> >>>>> 1-2. As for failure recovery, there is a difference how the Flink >>>>> batch and streaming programs handle failures. The failed parts of the >>>>> batch >>>>> jobs currently restart upon failures but there is an ongoing effort on >>>>> fine >>>>> grained fault tolerance which is somewhat similar to sparks lineage >>>>> tracking. (so technically this is exactly once semantic but that is >>>>> somewhat meaningless for batch jobs) >>>>> >>>>> For streaming programs we are currently working on fault tolerance, we >>>>> are hoping to support at least once processing guarantees in the 0.9 >>>>> release. After that we will focus our research efforts on an high >>>>> performance implementation of exactly once processing semantics, which is >>>>> still a hard topic in streaming systems. Storm's trident's exaclty once >>>>> semantics can only provide very low throughput while we are trying hard to >>>>> avoid this issue, as our streaming system is capable of much higher >>>>> throughput than storm in general as you can see on some perf measurements. >>>>> >>>>> 3. There are already many ml algorithms implemented for Flink but they >>>>> are scattered all around. We are planning to collect them in a machine >>>>> learning library soon. We are also implementing an adapter for Samoa which >>>>> will provide some streaming machine learning algorithms as well. Samoa >>>>> integration should be ready in January. >>>>> >>>>> 4. Flink carefully manages its memory use to avoid heap errors, and >>>>> utilizing memory space as effectively as it can. The optimizer for batch >>>>> programs also takes care of a lot of optimization steps that the user >>>>> would >>>>> manually have to do in other system, like optimizing the order of >>>>> transformations etc. There are of course parts of the program that still >>>>> needs to modified for maximal performance, for example parallelism >>>>> settings >>>>> for some operators in some cases. >>>>> >>>>> 5. As for the status of the Python API I personally cannot say very >>>>> much, maybe someone can jump in and help me with that question :) >>>>> >>>>> Regards, >>>>> Gyula >>>>> >>>>> On Thu, Dec 25, 2014 at 11:58 AM, Samarth Mailinglist < >>>>> [email protected]> wrote: >>>>> >>>>>> Thank you for your answer. I have a couple of follow up questions: >>>>>> 1. Does it support 'exactly once semantics' that Spark and Storm >>>>>> support? >>>>>> 2. (Related to 1) What happens when an error occurs during >>>>>> processing? >>>>>> 3. Is there a plan for adding Machine Learning support on top of >>>>>> Flink? Say Alternative Least Squares, Basic Naive Bayes? >>>>>> 4. When you say Flink manages itself, does it mean I don't have to >>>>>> fiddle with number of partitions (Spark), number of reduces / happers >>>>>> (Hadoop?) to optimize performance? (In some cases this might be needed) >>>>>> 5. How far along is the Python API? I don't see the specs in the >>>>>> Website. >>>>>> >>>>>> On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Dear Samarth, >>>>>>> >>>>>>> Besides the discussions you have mentioned [1] I can recommend one >>>>>>> of our recent presentations [2], especially the distinguishing Flink >>>>>>> section (from slide 16). >>>>>>> >>>>>>> It is generally a difficult question as both the systems are rapidly >>>>>>> evolving, so the answer can become outdated quite fast. However there >>>>>>> are >>>>>>> fundamental design features that are highly unlikely to change, for >>>>>>> example >>>>>>> Spark uses "true" batch processing, meaning that intermediate results >>>>>>> are >>>>>>> materialized (mostly in memory) as RDDs. Flink's engine is internally >>>>>>> more >>>>>>> like streaming, forwarding the results to the next operator asap. The >>>>>>> latter can yield performance benefits for more complex jobs. Flink also >>>>>>> gives you a query optimizer, spills gracefully to disk when the system >>>>>>> runs >>>>>>> out of memory and has some cool features around serialization. For >>>>>>> performance numbers and some more insight please check out the >>>>>>> presentation >>>>>>> [2] and do not hesitate to post a follow-up mail here if you come across >>>>>>> something unclear or extraordinary. >>>>>>> >>>>>>> [1] >>>>>>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark >>>>>>> [2] http://www.slideshare.net/GyulaFra/flink-apachecon >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Marton >>>>>>> >>>>>>> On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hey folks, I have a noob question. >>>>>>>> >>>>>>>> I already looked up the archives and saw a couple of discussions >>>>>>>> <http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark> >>>>>>>> about Spark and Flink. >>>>>>>> >>>>>>>> I am familiar with spark (the python API, esp MLLib), and I see >>>>>>>> many similarities between Flink and Spark. >>>>>>>> >>>>>>>> How does Flink distinguish itself from Spark? >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
