Wow, this mail should be on the website Robert, thanks for it I still have a point to try to understand better: my view is that once submitted the only perf related point is when you hit a flow of data. So a split can be slow bit it is not a that big deal. So a runner integration only needs to optimize process and nextElement logics, right?
It is almost always doable to batch that - with triggers and other constraints. So the portable model is elegant but not done to be "fast" in current state of impl. So this all leads to 2 needs: 1. Have some native runner for dev 2. Have some bulk api for prod In all cases this is decoralated of any runner no? Can even be a beam subproject built on top of beam which would be very sane and ensure a clear separation of concerns no? Le 6 mai 2018 00:59, "Robert Bradshaw" <rober...@google.com> a écrit : > Portability, at its core, is providing a spec for any runner to talk to any > SDK. I personally think it's done a great job in cleaning up the model by > forcing us to define a clean boundary (as specified at > https://github.com/apache/beam/tree/master/model ) between these two > components (even if the implementations of one or the other are > (temporarily, I hope for the most part) complicated).The pipeline (on the > runner submission side) and work execution (on what has traditionally been > called the fn api side) have concrete platform-independent descriptions, > rather than being a set of Java classes. > > Currently, the portion that lives on the "runner" side of this boundary is > often shared among Java runners (via libraries like runners core), but it > is all still part of each runner, and because of this it removes the > requirement for the Runner be Java just like it remove the requirement for > the SDK to speak Java. (For example, I think a Python Dask runner makes a > lot of sense, Dataflow may decide to implement larger portions of its > runner in Go or C++ or even behind a service, and I've used the Python > ULRunner to run the Java SDK over the Fn API for testing development > purposes). > > There is also the question of "why docker?" I actually don't see docker all > that intrinsic to the protocol; one only needs to be able to define and > bring up workers that communicate on specified ports. Docker happens to be > a fairly well supported way to package up an arbitrary chunk of code (in > any language), together with its nearly arbitrarily specified > dependencies/environment, in a way that's well specified and easy to start > up. > > I would welcome changes to > https://github.com/apache/beam/blob/v2.4.0/model/ > pipeline/src/main/proto/beam_runner_api.proto#L730 > that would provide alternatives to docker (one of which comes to mind is "I > already brought up a worker(s) for you (which could be the same process > that handled pipeline construction in testing scenarios), here's how to > connect to it/them.") Another option, which would seem to appeal to you in > particular, would be "the worker code is linked into the runner's binary, > use this process as the worker" (though note even for java-on-java, it can > be advantageous to shield the worker and runner code from each others > environments, dependencies, and version requirements.) This latter should > still likely use the FnApi to talk to itself (either over GRPC on local > ports, or possibly better via direct function calls eliminating the RPC > overhead altogether--this is how the fast local runner in Python works). > There may be runner environments well controlled enough that "start up the > workers" could be specified as "run this command line." We should make this > environment message extensible to other alternatives than "docker container > url," though of course we don't want the set of options to grow too large > or we loose the promise of portability unless every runner supports every > protocol. > > Of course, the runner is always free to execute any Fn for which it > completely understands the URN and the environment any way it pleases, e.g. > directly in process, or even via lighter-weight mechanism like Jython or > Graal, rather than asking an external process to do it. But we need a > lowest common denominator for executing arbitrary URNs runners are not > expected to understand. > > As an aside, there are also technical limitations in implementing > Portability > by simply requiring all runners to be Java and the portable layer simply > being wrappers of UserFnInLangaugeX in an equivalent UserFnObjectInJava, > executing everything as if it were pure Java. In particular the overheads > of unnecessarily crossing the language boundaries many times in a single > fused graph are often prohibitive. > > Sorry for the long email, but hopefully this helps shed some light on (at > least how I see) the portability effort (at the core of the Beam vision > statement) as well as concrete actions we can take to decouple it from > specific technologies. > > - Robert > > > On Sat, May 5, 2018 at 2:06 PM Romain Manni-Bucau <rmannibu...@gmail.com> > wrote: > > > All are good points. > > > The only "?" I keep is: why beam doesnt uses its visitor api to make the > portability transversal to all runners "mutating" the user model before > translation? Technically it sounds easy and avoid hacking all impl. Was it > tested and failed? > > > Le 5 mai 2018 22:50, "Thomas Weise" <t...@apache.org> a écrit : > > >> Docker isn't a silver bullet and may not be the best choice for all > environments (I'm also looking at potentially launching SDK workers in a > different way), but AFAIK there has not been any alternative proposal for > default SDK execution that can handle all of Python, Go and Java. > > >> Regardless of the default implementation, we should strive to keep the > implementation modular so users can plug in their own replacement as > needed. Looking at the prototype implementation, Docker comes downstream of > FlinkExecutableStageFunction, and it will be possible to supply a custom > implementation by making the translator pluggable (which I intend to work > on once backporting to master is complete), and possibly > "SDKHarnessManager" itself can also be swapped out. > > >> I would also prefer that for Flink and other Java based runners we > retain the option to inline executable stages that are in Java. I would > expect a good number of use cases to benefit from direct execution in the > task manager, and it may be good to offer the user that optimization. > > >> Thanks, > >> Thomas > > > > >> On Sat, May 5, 2018 at 12:54 PM, Eugene Kirpichov <kirpic...@google.com > > > wrote: > > >>> To add on that: Romain, if you are really excited about Graal as a > project, here are some constructive suggestions as to what you can do on a > reasonably short timeframe: > >>> - Propose/prototype a design for writing UDFs in Beam SQL using Graal > >>> - Go through the portability-related design documents, come up with a > more precise assessment of what parts are actually dependent on Docker's > container format and/or on Docker itself, and propose a plan for untangling > this dependency and opening the door to other mechanisms of cross-language > execution > > >>> On Sat, May 5, 2018 at 12:50 PM Eugene Kirpichov <kirpic...@google.com > > > wrote: > > >>>> Graal is a very young project, currently nowhere near the level of > maturity or completeness as to be sufficient for Beam to fully bet its > portability vision on it: > >>>> - Graal currently only claims to support Java and Javascript, with > Ruby and R in the status of "some applications may run", Python support > "just beginning", and Go lacking altogether. > >>>> - Regarding existing production usage, the Graal FAQ says it is "a > project with new innovative technology in its early stages." > > >>>> That said, as Graal matures, I think it would be reasonable to keep an > eye on it as a potential future lightweight alternative to containers for > pipelines where Graal's level of support is sufficient for this particular > pipeline. > > >>>> Please also keep in mind that execution of user code is only a small > part of the overall portability picture, and dependency on Docker is an > even smaller part of that (there is only 1 mention of the word "Docker" in > all of Beam's portability protos, and the mention is in an out-of-date TODO > comment). I hope this addresses your concerns. > > >>>> On Sat, May 5, 2018 at 11:49 AM Romain Manni-Bucau < > rmannibu...@gmail.com> wrote: > > >>>>> Agree > > >>>>> The jvm is still mainstream for big data and it is trivial to have a > remote facade to support natives but no point to have it in runners, it is > some particular transforms or even dofn and sources only... > > > >>>>> Le 5 mai 2018 19:03, "Andrew Pilloud" <apill...@google.com> a écrit > : > > >>>>>> Thanks for the examples earlier, I think Hazelcast is a great > example of something portability might make more difficult. I'm not working > on portability, but my understanding is that the data sent to the runner is > a blob of code and the name of the container to run it in. A runner with a > native language (java on Hazelcast for example) could run the code directly > without the container if it is in a language it supports. So when Hazelcast > sees a known java container specified, it just loads the java blob and runs > it. When it sees another container it rejects the pipeline. You could use > Graal in the Hazelcast runner to do this for a number of languages. I would > expect that this could also be done in the direct runner, which similarly > provides a native java environment, so portable Java pipelines can be > tested without docker? > > >>>>>> For another way to frame this: if Beam was originally written in Go, > we would be having a different discussion. A pipeline written entirely in > java wouldn't be possible, so instead to enable Hazelcast, we would have to > be able to run the java from portability without running the container. > > >>>>>> Andrew > > >>>>>> On Sat, May 5, 2018 at 1:48 AM Romain Manni-Bucau < > rmannibu...@gmail.com> wrote: > > > > >>>>>>> 2018-05-05 9:27 GMT+02:00 Ismaël Mejía <ieme...@gmail.com>: > > >>>>>>>> Graal would not be a viable solution for the reasons Henning and > Andrew > >>>>>>>> mentioned, or put in other words, when users choose a programming > language > >>>>>>>> they don’t choose only a ‘friendly’ syntax or programming model, > they > >>>>>>>> choose also the ecosystem that comes with it, and the libraries > that make > >>>>>>>> their life easier. However isolating these user > libraries/dependencies is a > >>>>>>>> hard problem and so far the standard solution to this problem is > to use > >>>>>>>> operating systems containers via docker. > > > >>>>>>> Graal solves that Ismael. Same kind of experience than running npm > libs on nashorn but with a more unified API to run any language soft. > > > > >>>>>>>> The Beam vision from day zero is to run pipelines written in > multiple > >>>>>>>> languages in runners in multiple systems, and so far we are not > doing this > >>>>>>>> in particular in the Apache runners. The portability work is the > cleanest > >>>>>>>> way to achieve this vision given the constraints. > > > >>>>>>> Hmm, did I read it wrong and we don't have specific integration of > the portable API in runners? This is what is messing up the runners and > limiting beam adoption on existing runners. > >>>>>>> Portable API is a feature buildable on top of runner, not in > runners. > >>>>>>> Same as a runner implementing the 5-6 primitives can run anything, > the portable API should just rely on that and not require more integration. > >>>>>>> It doesn't prevent more deep integrations as for some higher level > primitives existing in runners but it is not the case today for runners so > shouldn't exist IMHO. > > > > >>>>>>>> I agree however that for the Java SDK to Java runner case this can > >>>>>>>> represent additional pain, docker ideally should not be a > requirement for > >>>>>>>> Java users with the Direct runner and debugging a pipeline should > be as > >>>>>>>> easy as it is today. I think the Univerrsal Local Runner exists to > cover > >>>>>>>> the Portable case, but after looking at this JIRA I am not sure if > >>>>>>>> unification is coming (and by consequence if docker would be > mandatory). > >>>>>>>> https://issues.apache.org/jira/browse/BEAM-4239 > > >>>>>>>> I suppose for the distributed runners that they must implement the > full > >>>>>>>> Portability APIs to be considered Beam multi language compliant > but they > >>>>>>>> can prefer for performance reasons to translate without the > portability > >>>>>>>> APIs the Java to Java case. > > > > >>>>>>> This is my issue, language portability must NOT impact runners at > all, it is just a way to forward primitives to a runner. > >>>>>>> See it as a layer rewriting the pipeline and submitting it. No need > to modify any runner. > > > >>>>>>>> On Sat, May 5, 2018 at 9:11 AM Reuven Lax <re...@google.com> > wrote: > > >>>>>>>> > A beam cluster with the spark runner would include a spark > cluster, plus > >>>>>>>> what's needed for portability, plus the beam sdk. > > >>>>>>>> > On Fri, May 4, 2018, 11:55 PM Romain Manni-Bucau < > rmannibu...@gmail.com> > >>>>>>>> wrote: > > > > >>>>>>>> >> Le 5 mai 2018 08:43, "Reuven Lax" <re...@google.com> a écrit : > > >>>>>>>> >> I don't believe we enforce docker anywhere. In fact if someone > wanted to > >>>>>>>> run an all-windows beam cluster, they would probably not use > docker for > >>>>>>>> their runner (docker runs on Windows, but not efficiently). > > > > >>>>>>>> >> Or doesnt run sometimes - a colleague hit that yesterday :(. > > >>>>>>>> >> What is a "beam cluster" - opposed to a spark or foink cluster? > How > >>>>>>>> would it work on windows servers? > > > >>>>>>>> >> On Fri, May 4, 2018, 11:19 PM Romain Manni-Bucau < > rmannibu...@gmail.com> > >>>>>>>> wrote: > > > > >>>>>>>> >>> 2018-05-05 2:33 GMT+02:00 Andrew Pilloud <apill...@google.com > >: > > >>>>>>>> >>>> What docker really buys is a package format and runtime > environment > >>>>>>>> that is language and operating system agnostic. The docker > packaging and > >>>>>>>> runtime format is the de facto standard for portable applications > such as > >>>>>>>> this, and there is a group trying to turn it into an actual > standard. > > >>>>>>>> >>>> I would agree with you that dockerd has become bloated but > there are > >>>>>>>> projects that solve that. There is no longer lock-in to dockerd, > there are > >>>>>>>> package format compatible docker replacements that eliminate the > >>>>>>>> performance issues and overhead associated with docker. CRI-O ( > >>>>>>>> https://github.com/kubernetes-incubator/cri-o) is a really cool > RedHat > >>>>>>>> project which is a minimalist replacement for docker. I was > recently > >>>>>>>> working at a startup where I migrated our "data mover" appliance > from > >>>>>>>> Docker to CRI-O. Our application was able to get direct access to > the > >>>>>>>> ethernet driver and block devices which enabled a huge performance > boost > >>>>>>>> but we were also able to run containers produced by docker without > >>>>>>>> modification. > > >>>>>>>> >>>> You mention that docker is "detail of one runner+vendor > corrupting all > >>>>>>>> the project and adding complexity and work to everyone". It sounds > like you > >>>>>>>> have a specific example you'd like to share? Is there a runner > that is > >>>>>>>> unable to move to portability because of docker? > > > >>>>>>>> >>> IBM one for instance, some custom ones like an hazelcast based > one, > >>>>>>>> etc... More generally any runner developped outside beam itself - > even if > >>>>>>>> we take a snapshot today, most of beam's ones have the same > pitall. > > >>>>>>>> >>> Note: i never said docker was a bad techno or so. Let me try > to clarify. > > >>>>>>>> >>> Main issue is that you enforce docker usage which is still > trendy. It > >>>>>>>> is like scla which was promishing to kill java, check what it does > today... > >>>>>>>> >>> It starts to be tooled but it is also very impacting on the > deployment > >>>>>>>> side and for a good number of beam users who deploy it outside the > cloud it > >>>>>>>> is an issue. > >>>>>>>> >>> Keep in mind beam is embeddable by design, it is not a runner > >>>>>>>> environment and with the docker choice it imposes some environment > which is > >>>>>>>> inconsistent with beam design itself and this is where this choice > blocks. > > > > >>>>>>>> >>>> Andrew > > >>>>>>>> >>>> On Fri, May 4, 2018 at 4:32 PM Henning Rohde < > hero...@google.com> > >>>>>>>> wrote: > > >>>>>>>> >>>>> Romain, > > >>>>>>>> >>>>> Docker, unlike selinux, solves a great number of tangible > problems > >>>>>>>> for us with IMO a relatively small tax. It does not have to be the > only > >>>>>>>> way. Some of the concerns you bring up along with possibilities > were also > >>>>>>>> discussed here: > >>>>>>>> https://s.apache.org/beam-fn-api-container-contract. > I > >>>>>>>> encourage you to take a look. > > >>>>>>>> >>>>> Thanks, > >>>>>>>> >>>>> Henning > > > >>>>>>>> >>>>> On Fri, May 4, 2018 at 3:18 PM Romain Manni-Bucau < > >>>>>>>> rmannibu...@gmail.com> wrote: > > > > >>>>>>>> >>>>>> Le 4 mai 2018 21:31, "Henning Rohde" <hero...@google.com> > a > écrit : > > >>>>>>>> >>>>>> I disagree with the characterization of docker and the > implications > >>>>>>>> made towards portability. Graal looks like a neat project (and I > never > >>>>>>>> thought I would live to see the phrase "Practical Partial > Evaluation" ..), > >>>>>>>> but it doesn't address the needs of portability. In addition to > Luke's > >>>>>>>> examples, Go and most other languages don't work on it either. > Docker > >>>>>>>> containers also address packaging, OS dependencies, conflicting > versions > >>>>>>>> and distribution aspects in addition to truly universal language > support. > > > >>>>>>>> >>>>>> This is wrong, docker also has its conflicts, is not > universal > >>>>>>>> (fails on windows and mac easily - as host or not, cloud vendors > put layers > >>>>>>>> limiting or corrupting it, and it is an infra constraint imposed > and a > >>>>>>>> vendor locking not welcomed in beam IMHO). > > >>>>>>>> >>>>>> This is my main concern. All the work done looks like an > >>>>>>>> implemzntation detail of one runner+vendor corrupting all the > project and > >>>>>>>> adding complexity and work to everyone instead of keeping it > localised > >>>>>>>> (technically it is possible). > > >>>>>>>> >>>>>> Would you accept i enforce you to use selinux? Using docker > is the > >>>>>>>> same kind of constraint. > > > >>>>>>>> >>>>>> That said, it's entirely fine for some runners to use > Jython, Graal, > >>>>>>>> etc to provide a specialized offering similar to the direct > runners, but it > >>>>>>>> would be disjoint from portability IMO. > > >>>>>>>> >>>>>> On Fri, May 4, 2018 at 10:14 AM Romain Manni-Bucau < > >>>>>>>> rmannibu...@gmail.com> wrote: > > > > >>>>>>>> >>>>>>> Le 4 mai 2018 17:55, "Lukasz Cwik" <lc...@google.com> a > écrit : > > >>>>>>>> >>>>>>> I did take a look at Graal a while back when thinking > about how > >>>>>>>> execution environments could be defined, my concerns were related > to it not > >>>>>>>> supporting all of the features of a language. > >>>>>>>> >>>>>>> For example, its typical for Python to load and call > native > >>>>>>>> libraries and Graal can only execute C/C++ code that has been > compiled to > >>>>>>>> LLVM. > >>>>>>>> >>>>>>> Also, a good amount of people interested in using ML > libraries will > >>>>>>>> want access to GPUs to improve performance which I believe that > Graal can't > >>>>>>>> support. > > >>>>>>>> >>>>>>> It can be a very useful way to run simple lamda functions > written > >>>>>>>> in some language directly without needing to use a docker > environment but > >>>>>>>> you could probably use something even lighter weight then Graal > that is > >>>>>>>> language specific like Jython. > > > > >>>>>>>> >>>>>>> Right, the jsr223 impl works very well but you can also > have a perf > >>>>>>>> boost using native (like v8 java binding for js for instance). It > is way > >>>>>>>> more efficient than docker most of the time and not code intrusive > at all > >>>>>>>> in runners so likely more adoption-able and maintainable. That > said all is > >>>>>>>> doable behind the jsr223 so maybe not a big deal in terms of api. > We just > >>>>>>>> need to ensure portability work stay clean and actually portable > and doesnt > >>>>>>>> impact runners as poc done until today did. > > >>>>>>>> >>>>>>> Works for me. > > > >>>>>>>> >>>>>>> On Thu, May 3, 2018 at 10:05 PM Romain Manni-Bucau < > >>>>>>>> rmannibu...@gmail.com> wrote: > > >>>>>>>> >>>>>>>> Hi guys > > >>>>>>>> >>>>>>>> Since some time there are efforts to have a language > portable > >>>>>>>> support in beam but I cant really find a case it "works" being > based on > >>>>>>>> docker except for some vendor specific infra. > > >>>>>>>> >>>>>>>> Current solution: > > >>>>>>>> >>>>>>>> 1. Is runner intrusive (which is bad for beam and > prevents > >>>>>>>> adoption of big data vendors) > >>>>>>>> >>>>>>>> 2. Based on docker (which assumed a runtime environment > and is > >>>>>>>> very ops/infra intrusive and likely too $$ quite often for what it > brings) > > >>>>>>>> >>>>>>>> Did anyone had a look to graal which seems a way to make > the > >>>>>>>> feature doable in a lighter manner and optimized compared to > default jsr223 > >>>>>>>> impls? >