Wow, this mail should be on the website Robert, thanks for it

I still have a point to try to understand better: my view is that once
submitted the only perf related point is when you hit a flow of data. So a
split can be slow bit it is not a that big deal. So a runner integration
only needs to optimize process and nextElement logics, right?

It is almost always doable to batch that - with triggers and other
constraints. So the portable model is elegant but not done to be "fast" in
current state of impl.


So this all leads to 2 needs:

1. Have some native runner for dev
2. Have some bulk api for prod

In all cases this is decoralated of any runner no? Can even be a beam
subproject built on top of beam which would be very sane and ensure a clear
separation of concerns no?

Le 6 mai 2018 00:59, "Robert Bradshaw" <rober...@google.com> a écrit :

> Portability, at its core, is providing a spec for any runner to talk to any
> SDK. I personally think it's done a great job in cleaning up the model by
> forcing us to define a clean boundary (as specified at
> https://github.com/apache/beam/tree/master/model ) between these two
> components (even if the implementations of one or the other are
> (temporarily, I hope for the most part) complicated).The pipeline (on the
> runner submission side) and work execution (on what has traditionally been
> called the fn api side) have concrete platform-independent descriptions,
> rather than being a set of Java classes.
>
> Currently, the portion that lives on the "runner" side of this boundary is
> often shared among Java runners (via libraries like runners core), but it
> is all still part of each runner, and because of this it removes the
> requirement for the Runner be Java just like it remove the requirement for
> the SDK to speak Java. (For example, I think a Python Dask runner makes a
> lot of sense, Dataflow may decide to implement larger portions of its
> runner in Go or C++ or even behind a service, and I've used the Python
> ULRunner to run the Java SDK over the Fn API for testing development
> purposes).
>
> There is also the question of "why docker?" I actually don't see docker all
> that intrinsic to the protocol; one only needs to be able to define and
> bring up workers that communicate on specified ports. Docker happens to be
> a fairly well supported way to package up an arbitrary chunk of code (in
> any language), together with its nearly arbitrarily specified
> dependencies/environment, in a way that's well specified and easy to start
> up.
>
> I would welcome changes to
> https://github.com/apache/beam/blob/v2.4.0/model/
> pipeline/src/main/proto/beam_runner_api.proto#L730
> that would provide alternatives to docker (one of which comes to mind is "I
> already brought up a worker(s) for you (which could be the same process
> that handled pipeline construction in testing scenarios), here's how to
> connect to it/them.") Another option, which would seem to appeal to you in
> particular, would be "the worker code is linked into the runner's binary,
> use this process as the worker" (though note even for java-on-java, it can
> be advantageous to shield the worker and runner code from each others
> environments, dependencies, and version requirements.) This latter should
> still likely use the FnApi to talk to itself (either over GRPC on local
> ports, or possibly better via direct function calls eliminating the RPC
> overhead altogether--this is how the fast local runner in Python works).
> There may be runner environments well controlled enough that "start up the
> workers" could be specified as "run this command line." We should make this
> environment message extensible to other alternatives than "docker container
> url," though of course we don't want the set of options to grow too large
> or we loose the promise of portability unless every runner supports every
> protocol.
>
> Of course, the runner is always free to execute any Fn for which it
> completely understands the URN and the environment any way it pleases, e.g.
> directly in process, or even via lighter-weight mechanism like Jython or
> Graal, rather than asking an external process to do it. But we need a
> lowest common denominator for executing arbitrary URNs runners are not
> expected to understand.
>
> As an aside, there are also technical limitations in implementing
> Portability
> by simply requiring all runners to be Java and the portable layer simply
> being wrappers of UserFnInLangaugeX in an equivalent UserFnObjectInJava,
> executing everything as if it were pure Java. In particular the overheads
> of unnecessarily crossing the language boundaries many times in a single
> fused graph are often prohibitive.
>
> Sorry for the long email, but hopefully this helps shed some light on (at
> least how I see) the portability effort (at the core of the Beam vision
> statement) as well as concrete actions we can take to decouple it from
> specific technologies.
>
> - Robert
>
>
> On Sat, May 5, 2018 at 2:06 PM Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
> > All are good points.
>
> > The only "?" I keep is: why beam doesnt uses its visitor api to make the
> portability transversal to all runners "mutating" the user model before
> translation? Technically it sounds easy and avoid hacking all impl. Was it
> tested and failed?
>
> > Le 5 mai 2018 22:50, "Thomas Weise" <t...@apache.org> a écrit :
>
> >> Docker isn't a silver bullet and may not be the best choice for all
> environments (I'm also looking at potentially launching SDK workers in a
> different way), but AFAIK there has not been any alternative proposal for
> default SDK execution that can handle all of Python, Go and Java.
>
> >> Regardless of the default implementation, we should strive to keep the
> implementation modular so users can plug in their own replacement as
> needed. Looking at the prototype implementation, Docker comes downstream of
> FlinkExecutableStageFunction, and it will be possible to supply a custom
> implementation by making the translator pluggable (which I intend to work
> on once backporting to master is complete), and possibly
> "SDKHarnessManager" itself can also be swapped out.
>
> >> I would also prefer that for Flink and other Java based runners we
> retain the option to inline executable stages that are in Java. I would
> expect a good number of use cases to benefit from direct execution in the
> task manager, and it may be good to offer the user that optimization.
>
> >> Thanks,
> >> Thomas
>
>
>
> >> On Sat, May 5, 2018 at 12:54 PM, Eugene Kirpichov <kirpic...@google.com
> >
> wrote:
>
> >>> To add on that: Romain, if you are really excited about Graal as a
> project, here are some constructive suggestions as to what you can do on a
> reasonably short timeframe:
> >>> - Propose/prototype a design for writing UDFs in Beam SQL using Graal
> >>> - Go through the portability-related design documents, come up with a
> more precise assessment of what parts are actually dependent on Docker's
> container format and/or on Docker itself, and propose a plan for untangling
> this dependency and opening the door to other mechanisms of cross-language
> execution
>
> >>> On Sat, May 5, 2018 at 12:50 PM Eugene Kirpichov <kirpic...@google.com
> >
> wrote:
>
> >>>> Graal is a very young project, currently nowhere near the level of
> maturity or completeness as to be sufficient for Beam to fully bet its
> portability vision on it:
> >>>> - Graal currently only claims to support Java and Javascript, with
> Ruby and R in the status of "some applications may run", Python support
> "just beginning", and Go lacking altogether.
> >>>> - Regarding existing production usage, the Graal FAQ says it is "a
> project with new innovative technology in its early stages."
>
> >>>> That said, as Graal matures, I think it would be reasonable to keep an
> eye on it as a potential future lightweight alternative to containers for
> pipelines where Graal's level of support is sufficient for this particular
> pipeline.
>
> >>>> Please also keep in mind that execution of user code is only a small
> part of the overall portability picture, and dependency on Docker is an
> even smaller part of that (there is only 1 mention of the word "Docker" in
> all of Beam's portability protos, and the mention is in an out-of-date TODO
> comment). I hope this addresses your concerns.
>
> >>>> On Sat, May 5, 2018 at 11:49 AM Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
> >>>>> Agree
>
> >>>>> The jvm is still mainstream for big data and it is trivial to have a
> remote facade to support natives but no point to have it in runners, it is
> some particular transforms or even dofn and sources only...
>
>
> >>>>> Le 5 mai 2018 19:03, "Andrew Pilloud" <apill...@google.com> a écrit
> :
>
> >>>>>> Thanks for the examples earlier, I think Hazelcast is a great
> example of something portability might make more difficult. I'm not working
> on portability, but my understanding is that the data sent to the runner is
> a blob of code and the name of the container to run it in. A runner with a
> native language (java on Hazelcast for example) could run the code directly
> without the container if it is in a language it supports. So when Hazelcast
> sees a known java container specified, it just loads the java blob and runs
> it. When it sees another container it rejects the pipeline. You could use
> Graal in the Hazelcast runner to do this for a number of languages. I would
> expect that this could also be done in the direct runner, which similarly
> provides a native java environment, so portable Java pipelines can be
> tested without docker?
>
> >>>>>> For another way to frame this: if Beam was originally written in Go,
> we would be having a different discussion. A pipeline written entirely in
> java wouldn't be possible, so instead to enable Hazelcast, we would have to
> be able to run the java from portability without running the container.
>
> >>>>>> Andrew
>
> >>>>>> On Sat, May 5, 2018 at 1:48 AM Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>
>
> >>>>>>> 2018-05-05 9:27 GMT+02:00 Ismaël Mejía <ieme...@gmail.com>:
>
> >>>>>>>> Graal would not be a viable solution for the reasons Henning and
> Andrew
> >>>>>>>> mentioned, or put in other words, when users choose a programming
> language
> >>>>>>>> they don’t choose only a ‘friendly’ syntax or programming model,
> they
> >>>>>>>> choose also the ecosystem that comes with it, and the libraries
> that make
> >>>>>>>> their life easier. However isolating these user
> libraries/dependencies is a
> >>>>>>>> hard problem and so far the standard solution to this problem is
> to use
> >>>>>>>> operating systems containers via docker.
>
>
> >>>>>>> Graal solves that Ismael. Same kind of experience than running npm
> libs on nashorn but with a more unified API to run any language soft.
>
>
>
> >>>>>>>> The Beam vision from day zero is to run pipelines written in
> multiple
> >>>>>>>> languages in runners in multiple systems, and so far we are not
> doing this
> >>>>>>>> in particular in the Apache runners. The portability work is the
> cleanest
> >>>>>>>> way to achieve this vision given the constraints.
>
>
> >>>>>>> Hmm, did I read it wrong and we don't have specific integration of
> the portable API in runners? This is what is messing up the runners and
> limiting beam adoption on existing runners.
> >>>>>>> Portable API is a feature buildable on top of runner, not in
> runners.
> >>>>>>> Same as a runner implementing the 5-6 primitives can run anything,
> the portable API should just rely on that and not require more integration.
> >>>>>>> It doesn't prevent more deep integrations as for some higher level
> primitives existing in runners but it is not the case today for runners so
> shouldn't exist IMHO.
>
>
>
> >>>>>>>> I agree however that for the Java SDK to Java runner case this can
> >>>>>>>> represent additional pain, docker ideally should not be a
> requirement for
> >>>>>>>> Java users with the Direct runner and debugging a pipeline should
> be as
> >>>>>>>> easy as it is today. I think the Univerrsal Local Runner exists to
> cover
> >>>>>>>> the Portable case, but after looking at this JIRA I am not sure if
> >>>>>>>> unification is coming (and by consequence if docker would be
> mandatory).
> >>>>>>>> https://issues.apache.org/jira/browse/BEAM-4239
>
> >>>>>>>> I suppose for the distributed runners that they must implement the
> full
> >>>>>>>> Portability APIs to be considered Beam multi language compliant
> but they
> >>>>>>>> can prefer for performance reasons to translate without the
> portability
> >>>>>>>> APIs the Java to Java case.
>
>
>
> >>>>>>> This is my issue, language portability must NOT impact runners at
> all, it is just a way to forward primitives to a runner.
> >>>>>>> See it as a layer rewriting the pipeline and submitting it. No need
> to modify any runner.
>
>
> >>>>>>>> On Sat, May 5, 2018 at 9:11 AM Reuven Lax <re...@google.com>
> wrote:
>
> >>>>>>>> > A beam cluster with the spark runner would include a spark
> cluster, plus
> >>>>>>>> what's needed for portability, plus the beam sdk.
>
> >>>>>>>> > On Fri, May 4, 2018, 11:55 PM Romain Manni-Bucau <
> rmannibu...@gmail.com>
> >>>>>>>> wrote:
>
>
>
> >>>>>>>> >> Le 5 mai 2018 08:43, "Reuven Lax" <re...@google.com> a écrit :
>
> >>>>>>>> >> I don't believe we enforce docker anywhere. In fact if someone
> wanted to
> >>>>>>>> run an all-windows beam cluster, they would probably not use
> docker for
> >>>>>>>> their runner (docker runs on Windows, but not efficiently).
>
>
>
> >>>>>>>> >> Or doesnt run sometimes - a colleague hit that yesterday :(.
>
> >>>>>>>> >> What is a "beam cluster" - opposed to a spark or foink cluster?
> How
> >>>>>>>> would it work on windows servers?
>
>
> >>>>>>>> >> On Fri, May 4, 2018, 11:19 PM Romain Manni-Bucau <
> rmannibu...@gmail.com>
> >>>>>>>> wrote:
>
>
>
> >>>>>>>> >>> 2018-05-05 2:33 GMT+02:00 Andrew Pilloud <apill...@google.com
> >:
>
> >>>>>>>> >>>> What docker really buys is a package format and runtime
> environment
> >>>>>>>> that is language and operating system agnostic. The docker
> packaging and
> >>>>>>>> runtime format is the de facto standard for portable applications
> such as
> >>>>>>>> this, and there is a group trying to turn it into an actual
> standard.
>
> >>>>>>>> >>>> I would agree with you that dockerd has become bloated but
> there are
> >>>>>>>> projects that solve that. There is no longer lock-in to dockerd,
> there are
> >>>>>>>> package format compatible docker replacements that eliminate the
> >>>>>>>> performance issues and overhead associated with docker. CRI-O (
> >>>>>>>> https://github.com/kubernetes-incubator/cri-o) is a really cool
> RedHat
> >>>>>>>> project which is a minimalist replacement for docker. I was
> recently
> >>>>>>>> working at a startup where I migrated our "data mover" appliance
> from
> >>>>>>>> Docker to CRI-O. Our application was able to get direct access to
> the
> >>>>>>>> ethernet driver and block devices which enabled a huge performance
> boost
> >>>>>>>> but we were also able to run containers produced by docker without
> >>>>>>>> modification.
>
> >>>>>>>> >>>> You mention that docker is "detail of one runner+vendor
> corrupting all
> >>>>>>>> the project and adding complexity and work to everyone". It sounds
> like you
> >>>>>>>> have a specific example you'd like to share? Is there a runner
> that is
> >>>>>>>> unable to move to portability because of docker?
>
>
> >>>>>>>> >>> IBM one for instance, some custom ones like an hazelcast based
> one,
> >>>>>>>> etc... More generally any runner developped outside beam itself -
> even if
> >>>>>>>> we take a snapshot today, most of beam's ones have the same
> pitall.
>
> >>>>>>>> >>> Note: i never said docker was a bad techno or so. Let me try
> to clarify.
>
> >>>>>>>> >>> Main issue is that you enforce docker usage which is still
> trendy. It
> >>>>>>>> is like scla which was promishing to kill java, check what it does
> today...
> >>>>>>>> >>> It starts to be tooled but it is also very impacting on the
> deployment
> >>>>>>>> side and for a good number of beam users who deploy it outside the
> cloud it
> >>>>>>>> is an issue.
> >>>>>>>> >>> Keep in mind beam is embeddable by design, it is not a runner
> >>>>>>>> environment and with the docker choice it imposes some environment
> which is
> >>>>>>>> inconsistent with beam design itself and this is where this choice
> blocks.
>
>
>
> >>>>>>>> >>>> Andrew
>
> >>>>>>>> >>>> On Fri, May 4, 2018 at 4:32 PM Henning Rohde <
> hero...@google.com>
> >>>>>>>> wrote:
>
> >>>>>>>> >>>>> Romain,
>
> >>>>>>>> >>>>> Docker, unlike selinux, solves a great number of tangible
> problems
> >>>>>>>> for us with IMO a relatively small tax. It does not have to be the
> only
> >>>>>>>> way. Some of the concerns you bring up along with possibilities
> were also
> >>>>>>>> discussed here:
> >>>>>>>> https://s.apache.org/beam-fn-api-container-contract.
> I
> >>>>>>>> encourage you to take a look.
>
> >>>>>>>> >>>>> Thanks,
> >>>>>>>> >>>>>   Henning
>
>
> >>>>>>>> >>>>> On Fri, May 4, 2018 at 3:18 PM Romain Manni-Bucau <
> >>>>>>>> rmannibu...@gmail.com> wrote:
>
>
>
> >>>>>>>> >>>>>> Le 4 mai 2018 21:31, "Henning Rohde" <hero...@google.com>
> a
> écrit :
>
> >>>>>>>> >>>>>> I disagree with the characterization of docker and the
> implications
> >>>>>>>> made towards portability. Graal looks like a neat project (and I
> never
> >>>>>>>> thought I would live to see the phrase "Practical Partial
> Evaluation" ..),
> >>>>>>>> but it doesn't address the needs of portability. In addition to
> Luke's
> >>>>>>>> examples, Go and most other languages don't work on it either.
> Docker
> >>>>>>>> containers also address packaging, OS dependencies, conflicting
> versions
> >>>>>>>> and distribution aspects in addition to truly universal language
> support.
>
>
> >>>>>>>> >>>>>> This is wrong, docker also has its conflicts, is not
> universal
> >>>>>>>> (fails on windows and mac easily - as host or not, cloud vendors
> put layers
> >>>>>>>> limiting or corrupting it, and it is an infra constraint imposed
> and a
> >>>>>>>> vendor locking not welcomed in beam IMHO).
>
> >>>>>>>> >>>>>> This is my main concern. All the work done looks like an
> >>>>>>>> implemzntation detail of one runner+vendor corrupting all the
> project and
> >>>>>>>> adding complexity and work to everyone instead of keeping it
> localised
> >>>>>>>> (technically it is possible).
>
> >>>>>>>> >>>>>> Would you accept i enforce you to use selinux? Using docker
> is the
> >>>>>>>> same kind of constraint.
>
>
> >>>>>>>> >>>>>> That said, it's entirely fine for some runners to use
> Jython, Graal,
> >>>>>>>> etc to provide a specialized offering similar to the direct
> runners, but it
> >>>>>>>> would be disjoint from portability IMO.
>
> >>>>>>>> >>>>>> On Fri, May 4, 2018 at 10:14 AM Romain Manni-Bucau <
> >>>>>>>> rmannibu...@gmail.com> wrote:
>
>
>
> >>>>>>>> >>>>>>> Le 4 mai 2018 17:55, "Lukasz Cwik" <lc...@google.com> a
> écrit :
>
> >>>>>>>> >>>>>>> I did take a look at Graal a while back when thinking
> about how
> >>>>>>>> execution environments could be defined, my concerns were related
> to it not
> >>>>>>>> supporting all of the features of a language.
> >>>>>>>> >>>>>>> For example, its typical for Python to load and call
> native
> >>>>>>>> libraries and Graal can only execute C/C++ code that has been
> compiled to
> >>>>>>>> LLVM.
> >>>>>>>> >>>>>>> Also, a good amount of people interested in using ML
> libraries will
> >>>>>>>> want access to GPUs to improve performance which I believe that
> Graal can't
> >>>>>>>> support.
>
> >>>>>>>> >>>>>>> It can be a very useful way to run simple lamda functions
> written
> >>>>>>>> in some language directly without needing to use a docker
> environment but
> >>>>>>>> you could probably use something even lighter weight then Graal
> that is
> >>>>>>>> language specific like Jython.
>
>
>
> >>>>>>>> >>>>>>> Right, the jsr223 impl works very well but you can also
> have a perf
> >>>>>>>> boost using native (like v8 java binding for js for instance). It
> is way
> >>>>>>>> more efficient than docker most of the time and not code intrusive
> at all
> >>>>>>>> in runners so likely more adoption-able and maintainable. That
> said all is
> >>>>>>>> doable behind the jsr223 so maybe not a big deal in terms of api.
> We just
> >>>>>>>> need to ensure portability work stay clean and actually portable
> and doesnt
> >>>>>>>> impact runners as poc done until today did.
>
> >>>>>>>> >>>>>>> Works for me.
>
>
> >>>>>>>> >>>>>>> On Thu, May 3, 2018 at 10:05 PM Romain Manni-Bucau <
> >>>>>>>> rmannibu...@gmail.com> wrote:
>
> >>>>>>>> >>>>>>>> Hi guys
>
> >>>>>>>> >>>>>>>> Since some time there are efforts to have a language
> portable
> >>>>>>>> support in beam but I cant really find a case it "works" being
> based on
> >>>>>>>> docker except for some vendor specific infra.
>
> >>>>>>>> >>>>>>>> Current solution:
>
> >>>>>>>> >>>>>>>> 1. Is runner intrusive (which is bad for beam and
> prevents
> >>>>>>>> adoption of big data vendors)
> >>>>>>>> >>>>>>>> 2. Based on docker (which assumed a runtime environment
> and is
> >>>>>>>> very ops/infra intrusive and likely too $$ quite often for what it
> brings)
>
> >>>>>>>> >>>>>>>> Did anyone had a look to graal which seems a way to make
> the
> >>>>>>>> feature doable in a lighter manner and optimized compared to
> default jsr223
> >>>>>>>> impls?
>

Reply via email to