Does the same runner work for Java pipelines? (I assume so, given that it
uses portability framework.) If so, does it provide similar speedup?

On Fri, Feb 16, 2018 at 7:37 PM Robert Bradshaw <rober...@google.com> wrote:

> If there are no concerns, I say let's merge this.
>
> On Fri, Feb 16, 2018 at 9:39 AM, Charles Chen <c...@google.com> wrote:
> > I hope those interested have had time to test this out.  I have sent out
> > https://github.com/apache/beam/pull/4696 to switch to using this fast
> runner
> > as the default DirectRunner for local execution.  Let me know if there
> are
> > any concerns.
> >
> > On Tue, Feb 13, 2018 at 12:17 PM Charles Chen <c...@google.com> wrote:
> >>
> >> This is now checked into master.  You can use it by setting
> >> --runner=SwitchingDirectRunner.  Please let us know if you run into any
> >> issues.
> >>
> >>
> >> On Thu, Feb 8, 2018 at 10:30 AM Romain Manni-Bucau <
> rmannibu...@gmail.com>
> >> wrote:
> >>>
> >>> Very interesting! Sounds like a sane way for beam future and I'm very
> >>> happy it is consistent with the current Java experience: no need to
> >>> interlace runners at the end, it makes design, code and user
> experience way
> >>> better than trying to put everything in the direct runner :).
> >>>
> >>> Le 8 févr. 2018 19:20, "María García Herrero" <mari...@google.com> a
> >>> écrit :
> >>>>
> >>>> Amazing improvement, Charles.
> >>>> Thanks for the effort!
> >>>>
> >>>>
> >>>> On Thu, Feb 8, 2018 at 10:14 AM Eugene Kirpichov <
> kirpic...@google.com>
> >>>> wrote:
> >>>>>
> >>>>> Sounds awesome, congratulations and thanks for making this happen!
> >>>>>
> >>>>> On Thu, Feb 8, 2018 at 10:07 AM Raghu Angadi <rang...@google.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> This is terrific news! Thanks Charles.
> >>>>>>
> >>>>>> On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <c...@google.com>
> wrote:
> >>>>>>>
> >>>>>>> Local execution of Beam pipelines on the Python DirectRunner
> >>>>>>> currently suffers from performance issues, which makes it hard for
> pipeline
> >>>>>>> authors to iterate, especially on medium to large size datasets.
> We would
> >>>>>>> like to optimize and make this a better experience for Beam users.
> >>>>>>>
> >>>>>>>
> >>>>>>> The FnApiRunner was written as a way of leveraging the portability
> >>>>>>> framework execution code path for local portability development.
> We've found
> >>>>>>> it also provides great speedups in batch execution with no user
> changes
> >>>>>>> required, so we propose to switch to use this runner by default in
> batch
> >>>>>>> pipelines.  For example, WordCount on the Shakespeare dataset with
> a single
> >>>>>>> CPU core now takes 50 seconds to run, compared to 12 minutes
> before; this is
> >>>>>>> a 15x performance improvement that users can get for free, with no
> user
> >>>>>>> pipeline changes.
> >>>>>>>
> >>>>>>>
> >>>>>>> The JIRA for this change is here
> >>>>>>> (https://issues.apache.org/jira/browse/BEAM-3644), and a
> candidate patch is
> >>>>>>> available here (https://github.com/apache/beam/pull/4634). I have
> been
> >>>>>>> working over the last month on making this an automatic drop-in
> replacement
> >>>>>>> for the current DirectRunner when applicable.  Before it becomes
> the
> >>>>>>> default, you can try this runner now by manually specifying
> >>>>>>> apache_beam.runners.portability.fn_api_runner.FnApiRunner as the
> runner.
> >>>>>>>
> >>>>>>>
> >>>>>>> Even with this change, local Python pipeline execution can only
> >>>>>>> effectively use one core because of the Python GIL.  A natural
> next step to
> >>>>>>> further improve performance will be to refactor the FnApiRunner to
> allow for
> >>>>>>> multi-process execution.  This is being tracked here
> >>>>>>> (https://issues.apache.org/jira/browse/BEAM-3645).
> >>>>>>>
> >>>>>>>
> >>>>>>> Best,
> >>>>>>>
> >>>>>>> Charles
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> Impact is the effect that wouldn’t have happened if you hadn’t done
> what
> >>>> you did.
> >>>>
> >>>>
> >
>

Reply via email to