Hi Matt, My project is for my PhD. So, I am interested in those 0.1% of use cases.
--Pankaj On Sat, May 4, 2019, 10:48 AM Matt Casters <[email protected]> wrote: > Anything can be coded in any form or language on any platform. > However, doing so takes time and effort. Maintaining the code takes time > as well as protecting the investments you made from changes in the > ecosystem. > This is obviously where APIs like Beam come into play quite heavily. New > technology seems to come around like fads these days and that innovation is > obviously not a bad thing. We would still be using Map/Reduce if it was. > But for people trying to build solutions changing platforms is a painful > process incurring massive costs. > So with that in mind I would bounce this question back: why on Earth would > you *want* to write for a specific platform? Are you *really* interested > in those 0.1% use cases and is it really helping your business move > forward? It's possible but if not, I would strongly advice against it. > > Just my 2 cents. > > Cheers, > Matt > --- > Matt Casters <m <[email protected]>[email protected]> > Senior Solution Architect, Kettle Project Founder > > > > > Op vr 3 mei 2019 om 22:42 schreef Jan Lukavský <[email protected]>: > >> Hi, >> >> On 5/3/19 12:20 PM, Maximilian Michels wrote: >> > Hi Jan, >> > >> >> Typical example could be a machine learning task, where you might >> >> have a lot of data cleansing and simple transformations, followed by >> >> some ML algorithm (e.g. SVD). One might want to use Spark MLlib for >> >> the ML task, but Beam for all the transformations around. Then, >> >> porting to different runner would mean only provide different >> >> implementation of the SVD, but everything else would remaining the >> same. >> > >> > This is a fair point. Of course you could always split up the pipeline >> > into two jobs, e.g. have a native Spark job and a Beam job running on >> > Spark. >> > >> > Something that came to my mind is "unsafe" in Rust which allows you to >> > leave the safe abstractions of Rust and use raw C code. If Beam had >> > something like that which really emphasized the non-portable aspect of >> > a transform, that could change things: >> > >> > Pipeline p = .. >> > p.getOptions().setAllowNonPortable(true); >> > p.apply( >> > NonPortable.of(new MyFlinkOperator(), FlinkRunner.class)); >> > >> > Again, I'm not sure we want to go down that road, but if there are >> > really specific use cases, we could think about it. >> Yes, this is exactly what I meant. I think that this doesn't threat any >> of Beam's selling points, because this way, you declare you *want* your >> pipeline being non-portable, so if you don't do it on purpose, your >> pipeline will still be portable. The key point here is that the >> underlying systems are likely to evolve quicker than Beam (in some >> directions or some ways - Beam might on the other hand bring features to >> these systems, that's for sure). Examples might be Spark's MLlib or >> Flink's iterative streams. >> > >> >> Generally, there are optimizations that could be really dependent on >> >> the pipeline. Only then you might have enough information that can >> >> result in some very specific optimization. >> > >> > If these pattern can be detected in DAGs, then we can built >> > optimizations into the FlinkRunner. If that is not feasible, then >> > you're out luck. Could you describe an optimization that you miss in >> > Beam? >> >> I think that sometimes you cannot infer all possible optimizations from >> the DAG itself. If you read from a source (e.g. Kafka), information >> about how do you partition data when writing to Kafka might help you >> avoid additional shuffling in some cases. That's probably something you >> could be in theory able to do via some annotations of sources, but the >> fundamental question here is - do you really want to do that? Or just >> let the user perform some hard coding when he knows that it might help >> in his particular case (possible even corner case)? >> >> Jan >> >> > >> > Cheers, >> > Max >> > >> > On 02.05.19 22:44, Jan Lukavský wrote: >> >> Hi Max, >> >> >> >> comments inline. >> >> >> >> On 5/2/19 3:29 PM, Maximilian Michels wrote: >> >>> Couple of comments: >> >>> >> >>> * Flink transforms >> >>> >> >>> It wouldn't be hard to add a way to run arbitrary Flink operators >> >>> through the Beam API. Like you said, once you go down that road, you >> >>> loose the ability to run the pipeline on a different Runner. And >> >>> that's precisely one of the selling points of Beam. I'm afraid once >> >>> you even allow 1% non-portable pipelines, you have lost it all. >> >> Absolutely true, but - the question here is "how much effort do I >> >> have to invest in order to port pipeline to different runner?". If >> >> this effort is low, I'd say the pipeline remains "nearly portable". >> >> Typical example could be a machine learning task, where you might >> >> have a lot of data cleansing and simple transformations, followed by >> >> some ML algorithm (e.g. SVD). One might want to use Spark MLlib for >> >> the ML task, but Beam for all the transformations around. Then, >> >> porting to different runner would mean only provide different >> >> implementation of the SVD, but everything else would remaining the >> same. >> >>> >> >>> Now, it would be a different story if we had a runner-agnostic way >> >>> of running Flink operators on top of Beam. For a subset of the Flink >> >>> transformations that might actually be possible. I'm not sure if >> >>> it's feasible for Beam to depend on the Flink API. >> >>> >> >>> * Pipeline Tuning >> >>> >> >>> There are less bells and whistlers in the Beam API then there are in >> >>> Flink's. I'd consider that a feature. As Robert pointed out, the >> >>> Runner can make any optimizations that it wants to do. If you have >> >>> an idea for an optimizations we could built it into the FlinkRunner. >> >> >> >> Generally, there are optimizations that could be really dependent on >> >> the pipeline. Only then you might have enough information that can >> >> result in some very specific optimization. >> >> >> >> Jan >> >> >> >> >> >>> On 02.05.19 13:44, Robert Bradshaw wrote: >> >>>> Correct, there's no out of the box way to do this. As mentioned, this >> >>>> would also result in non-portable pipelines. However, even the >> >>>> portability framework is set up such that runners can recognize >> >>>> particular transforms and provide their own implementations thereof >> >>>> (which is how translations are done for ParDo, GroupByKey, etc.) and >> >>>> it is encouraged that runners do this for composite operations they >> >>>> have can do better on (e.g. I know Flink maps Reshard directly to >> >>>> Redistribute rather than using the generic pair-with-random-key >> >>>> implementation). >> >>>> >> >>>> If you really want to do this for MyFancyFlinkOperator, the current >> >>>> solution is to adapt/extend FlinkRunner (possibly forking code) to >> >>>> understand this operation and its substitution. >> >>>> >> >>>> >> >>>> On Thu, May 2, 2019 at 11:09 AM Jan Lukavský <[email protected]> >> wrote: >> >>>>> >> >>>>> Just to clarify - the code I posted is just a proposal, it is not >> >>>>> actually possible currently. >> >>>>> >> >>>>> On 5/2/19 11:05 AM, Jan Lukavský wrote: >> >>>>>> Hi, >> >>>>>> >> >>>>>> I'd say that what Pankaj meant could be rephrased as "What if I >> want >> >>>>>> to manually tune or tweak my Pipeline for specific runner? Do I >> have >> >>>>>> any options for that?". As I understand it, currently the answer >> is, >> >>>>>> no, PTransforms are somewhat hardwired into runners and the way >> they >> >>>>>> expand cannot be controlled or tuned. But, that could be changed, >> >>>>>> maybe something like this would make it possible: >> >>>>>> >> >>>>>> PCollection<...> in = ...; >> >>>>>> in.apply(new MyFancyFlinkOperator()); >> >>>>>> >> >>>>>> // ... >> >>>>>> >> >>>>>> >> in.getPipeline().getOptions().as(FlinkPipelineOptions.class).setTransformOverride(MyFancyFlinkOperator.class, >> >> >>>>>> >> >>>>>> new MyFancyFlinkOperatorExpander()); >> >>>>>> >> >>>>>> >> >>>>>> The `expander` could have access to full Flink API (through the >> >>>>>> runner) and that way any transform could be overridden or >> customized >> >>>>>> for specific runtime conditions. Of course, this has downside, that >> >>>>>> you end up with non portable pipeline (also, this is probably in >> >>>>>> conflict with general portability framework). But, as I see it, >> >>>>>> usually you would need to override only very small part of you >> >>>>>> pipeline. So, say 90% of pipeline would be portable and in order to >> >>>>>> port it to different runner, you would need to implement only small >> >>>>>> specific part. >> >>>>>> >> >>>>>> Jan >> >>>>>> >> >>>>>> On 5/2/19 9:45 AM, Robert Bradshaw wrote: >> >>>>>>> On Thu, May 2, 2019 at 12:13 AM Pankaj Chand >> >>>>>>> <[email protected]> wrote: >> >>>>>>>> I thought by choosing Beam, I would get the benefits of both >> >>>>>>>> (Spark >> >>>>>>>> and Flink), one at a time. Now, I'm understanding that I might >> not >> >>>>>>>> get the full potential from either of the two. >> >>>>>>> You get the benefit of being able to choose, without rewriting >> your >> >>>>>>> pipeline, whether to run on Spark or Flink. Or the next new runner >> >>>>>>> that comes around. As well as the Beam model, API, etc. >> >>>>>>> >> >>>>>>>> Example: If I use Beam with Flink, and then a new feature is >> added >> >>>>>>>> to Flink but I cannot access it via Beam, and that feature is not >> >>>>>>>> important to the Beam community, then what is the suggested >> >>>>>>>> workaround? If I really need that feature, I would not want to >> >>>>>>>> re-write my pipeline in Flink from scratch. >> >>>>>>> How is this different from "If I used Spark, and a new feature is >> >>>>>>> added to Flink, I cannot access it from Spark. If that feature >> >>>>>>> is not >> >>>>>>> important to the Spark community, then what is the suggested >> >>>>>>> workaround? If I really need that feature, I would not want to >> >>>>>>> re-write my pipeline in Flink from scratch." Or vice-versa. Or >> >>>>>>> when a >> >>>>>>> new feature added to Beam itself (that may not be not present in >> >>>>>>> any >> >>>>>>> of the underlying systems). Beam's feature set is neither the >> >>>>>>> intersection nor union of the feature sets of those runners it has >> >>>>>>> available as execution engines. (Even the notion of what is >> >>>>>>> meant by >> >>>>>>> "feature" is nebulous enough that it's hard to make this >> discussion >> >>>>>>> concrete.) >> >>>>>>> >> >>>>>>>> Is it possible that in the near future, most of Beam's >> >>>>>>>> capabilities >> >>>>>>>> would favor Google's Dataflow API? That way, Beam could be used >> to >> >>>>>>>> lure developers and organizations who would typically use >> >>>>>>>> Spark/Flink, with the promise of portability. After they get >> >>>>>>>> dependent on Beam and cannot afford to re-write their pipelines >> in >> >>>>>>>> Spark/Flink from scratch, they would realize that Beam does not >> >>>>>>>> give >> >>>>>>>> access to some of the capabilities of the free engines that >> >>>>>>>> they may >> >>>>>>>> require. Then, they would be told that if they want all possible >> >>>>>>>> capabilities and would want to use their code in Beam, they could >> >>>>>>>> pay for the Dataflow engine instead. >> >>>>>>> Google is very upfront about the fact that they are selling a >> >>>>>>> service >> >>>>>>> to run Beam pipelines in a completely managed way. But Google has >> >>>>>>> *also* invested very heavily in making sure that the portability >> >>>>>>> story >> >>>>>>> is not just talk, for those who need or want to run their jobs on >> >>>>>>> premise or elsewhere (now or in the future). It is our goal that >> >>>>>>> all >> >>>>>>> runners be able to run all pipelines, and this is a community >> >>>>>>> effort. >> >>>>>>> >> >>>>>>>> Pankaj >> >>>>>>>> >> >>>>>>>> On Tue, Apr 30, 2019 at 6:15 PM Kenneth Knowles <[email protected] >> > >> >>>>>>>> wrote: >> >>>>>>>>> It is worth noting that Beam isn't solely a portability layer >> >>>>>>>>> that >> >>>>>>>>> exposes underlying API features, but a feature-rich layer in its >> >>>>>>>>> own right, with carefully coherent abstractions. For example, >> >>>>>>>>> quite >> >>>>>>>>> early on the SparkRunner supported streaming aspects of the Beam >> >>>>>>>>> model - watermarks, windowing, triggers - that were not really >> >>>>>>>>> available any other way. Beam's various features sometimes >> >>>>>>>>> requires >> >>>>>>>>> just a pass-through API and sometimes requires clever new >> >>>>>>>>> implementation. And everything is moving constantly. I don't see >> >>>>>>>>> Beam as following the features of any engine, but rather >> >>>>>>>>> coming up >> >>>>>>>>> with new needed data processing abstractions and figuring out >> how >> >>>>>>>>> to efficiently implement them on top of various architectures. >> >>>>>>>>> >> >>>>>>>>> Kenn >> >>>>>>>>> >> >>>>>>>>> On Tue, Apr 30, 2019 at 8:37 AM kant kodali <[email protected] >> > >> >>>>>>>>> wrote: >> >>>>>>>>>> Staying behind doesn't imply one is better than the other and I >> >>>>>>>>>> didn't mean that in any way but I fail to see how an >> abstraction >> >>>>>>>>>> framework like Beam can stay ahead of the underlying execution >> >>>>>>>>>> engines? >> >>>>>>>>>> >> >>>>>>>>>> For example, If a new feature is added into the underlying >> >>>>>>>>>> execution engine that doesn't fit the interface of Beam or >> >>>>>>>>>> breaks >> >>>>>>>>>> then I would think the interface would need to be changed. >> >>>>>>>>>> Another >> >>>>>>>>>> example would say the underlying execution engines take >> >>>>>>>>>> different >> >>>>>>>>>> kind's of parameters for the same feature then it isn't so >> >>>>>>>>>> straight forward to come up with an interface since there >> >>>>>>>>>> might be >> >>>>>>>>>> very little in common in the first place so, in that sense, I >> >>>>>>>>>> fail >> >>>>>>>>>> to see how Beam can stay ahead. >> >>>>>>>>>> >> >>>>>>>>>> "Of course the API itself is Spark-specific, but it borrows >> >>>>>>>>>> heavily (among other things) on ideas that Beam itself >> pioneered >> >>>>>>>>>> long before Spark 2.0" Good to know. >> >>>>>>>>>> >> >>>>>>>>>> "one of the things Beam has focused on was a language >> >>>>>>>>>> portability >> >>>>>>>>>> framework" Sure but how important is this for a typical >> >>>>>>>>>> user? Do >> >>>>>>>>>> people stop using a particular tool because it is in an X >> >>>>>>>>>> language? I personally would put features first over language >> >>>>>>>>>> portability and it's completely fine that may not be in line >> >>>>>>>>>> with >> >>>>>>>>>> Beam's priorities. All said I can agree Beam focus on language >> >>>>>>>>>> portability is great. >> >>>>>>>>>> >> >>>>>>>>>> On Tue, Apr 30, 2019 at 2:48 AM Maximilian Michels >> >>>>>>>>>> <[email protected]> wrote: >> >>>>>>>>>>>> I wouldn't say one is, or will always be, in front of or >> >>>>>>>>>>>> behind >> >>>>>>>>>>>> another. >> >>>>>>>>>>> That's a great way to phrase it. I think it is very common to >> >>>>>>>>>>> jump to >> >>>>>>>>>>> the conclusion that one system is better than the other. In >> >>>>>>>>>>> reality it's >> >>>>>>>>>>> often much more complicated. >> >>>>>>>>>>> >> >>>>>>>>>>> For example, one of the things Beam has focused on was a >> >>>>>>>>>>> language >> >>>>>>>>>>> portability framework. Do I get this with Flink? No. Does that >> >>>>>>>>>>> mean Beam >> >>>>>>>>>>> is better than Flink? No. Maybe a better question would be, >> >>>>>>>>>>> do I >> >>>>>>>>>>> want to >> >>>>>>>>>>> be able to run Python pipelines? >> >>>>>>>>>>> >> >>>>>>>>>>> This is just an example, there are many more factors to >> >>>>>>>>>>> consider. >> >>>>>>>>>>> >> >>>>>>>>>>> Cheers, >> >>>>>>>>>>> Max >> >>>>>>>>>>> >> >>>>>>>>>>> On 30.04.19 10:59, Robert Bradshaw wrote: >> >>>>>>>>>>>> Though we all certainly have our biases, I think it's fair to >> >>>>>>>>>>>> say that >> >>>>>>>>>>>> all of these systems are constantly innovating, borrowing >> >>>>>>>>>>>> ideas >> >>>>>>>>>>>> from >> >>>>>>>>>>>> one another, and have their strengths and weaknesses. I >> >>>>>>>>>>>> wouldn't >> >>>>>>>>>>>> say >> >>>>>>>>>>>> one is, or will always be, in front of or behind another. >> >>>>>>>>>>>> >> >>>>>>>>>>>> Take, as the given example Spark Structured Streaming. Of >> >>>>>>>>>>>> course >> >>>>>>>>>>>> the >> >>>>>>>>>>>> API itself is spark-specific, but it borrows heavily (among >> >>>>>>>>>>>> other >> >>>>>>>>>>>> things) on ideas that Beam itself pioneered long before >> >>>>>>>>>>>> Spark 2.0, >> >>>>>>>>>>>> specifically the unification of batch and streaming >> processing >> >>>>>>>>>>>> into a >> >>>>>>>>>>>> single API, and the event-time based windowing (triggering) >> >>>>>>>>>>>> model for >> >>>>>>>>>>>> consistently and correctly handling distributed, >> >>>>>>>>>>>> out-of-order data >> >>>>>>>>>>>> streams. >> >>>>>>>>>>>> >> >>>>>>>>>>>> Of course there are also operational differences. Spark, for >> >>>>>>>>>>>> example, >> >>>>>>>>>>>> is very tied to the micro-batch style of execution whereas >> >>>>>>>>>>>> Flink is >> >>>>>>>>>>>> fundamentally very continuous, and Beam delegates to the >> >>>>>>>>>>>> underlying >> >>>>>>>>>>>> runner. >> >>>>>>>>>>>> >> >>>>>>>>>>>> It is certainly Beam's goal to keep overhead minimal, and >> >>>>>>>>>>>> one of >> >>>>>>>>>>>> the >> >>>>>>>>>>>> primary selling points is the flexibility of portability (of >> >>>>>>>>>>>> both the >> >>>>>>>>>>>> execution runtime and the SDK) as your needs change. >> >>>>>>>>>>>> >> >>>>>>>>>>>> - Robert >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> On Tue, Apr 30, 2019 at 5:29 AM <[email protected]> wrote: >> >>>>>>>>>>>>> Ofcourse! I suspect beam will always be one or two step >> >>>>>>>>>>>>> backwards to the new functionality that is available or >> >>>>>>>>>>>>> yet to >> >>>>>>>>>>>>> come. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> For example: Spark Structured Streaming is still not >> >>>>>>>>>>>>> available, >> >>>>>>>>>>>>> no CEP apis yet and much more. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> Sent from my iPhone >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> On Apr 30, 2019, at 12:11 AM, Pankaj Chand >> >>>>>>>>>>>>> <[email protected]> wrote: >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> Will Beam add any overhead or lack certain API/functions >> >>>>>>>>>>>>> available in Spark/Flink? >> >
