Re: Brief of interactive Beam

2019-08-14 Thread Pablo Estrada
Hi Ning!
Thanks for the design doc and the explanations.

I think I can grasp some of the concepts, but it is not yet 100% clear to
me why it's necessary to define a new abstraction to have interactivity.
Could you elaborate? Perhaps as a section in the  doc? : )

A lot of the motivation for this doc seems related to how we decide which
PCollections to cache - so as to avoid rerunning parts of a pipeline
whenever a user decides to visualize specific parts. I think that makes
sense (and probably helps to have interactivity on streaming).

I agree that it's a little odd that InteractiveRunner receives an
underlying runner. That certainly suggests that the functionality is
orthogonal.

So, in short: I think my feedback is similar to others: Can you justify
further (or reconsider) why pipeline creation and execution need to be
different?

I can see what's the need for the watch. Can you also tell us more about
how a user would use visualize? Do they pass the kind of plot to have?

Thanks!
-P.

On Wed, Aug 14, 2019 at 12:03 PM Ning Kang  wrote:

> Q1:
> The document is shared (
> https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing).
> If inside Google, short link (go/ibeam-external
> ). I cannot share internal
> documents, but you can reach out if you need internal engineering plan.
>
> Q2:
> Yes, watch() is optimization used for using visualization() and building
> further on the pipeline. And the user doesn't need to call it if they
> simply define the pipeline in the notebook.
>
> Q3 and Q4:
> I'm only focusing on direct runner as underlying runner. We'll get rid of
> many of existing interactive Beam implementation. We can't provide
> portability for interactivity. Users can run the pipeline with other
> runners though due to the pipeline portability.
> Our work is to reduce the new concepts a user needs to know when they want
> to run interactive Beam. The implementation could be arbitrarily
> complicated and open sourced though. Currently, the interactive runner
> looks like as if it's supporting all kinds of underlying runners. We want
> to rid of it too.
>
> On 2019/08/08 00:01:06, Ahmet Altay  wrote:
> > Ning, thank you for the heads up.
> >
> > All, this is a proposed work for improving interactive Beam experience.
> As
> > mentioned in Ning's email, new concepts are being introduced. And in
> > addition iBeam as a name is used as a new reference. I hope that bringing
> > the discussion to the mailing list will give it the additional
> > visibility and more people could share their feedback.
> >
> > (cc'ing a few folks that might be interested +Robert Bradshaw
> >  +Valentyn Tymofieiev  +Sindy
> Li
> >  +Brian Hulette  )
> >
> > Ahmet
> >
> >
> > On Wed, Aug 7, 2019 at 12:36 PM Ning Kang  wrote:
> >
> > > To whom may concern,
> > >
> > > This is Ning from Google. We are currently making efforts to leverage
> an
> > > interactive runner under python beam sdk.
> > >
> > > There is already an interactive Beam (iBeam for short) runner with
> jupyter
> > > notebook in the repo
> > > <
> https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive
> >
> > > .
> > > Following the instructions on that page, one can set up an interactive
> > > environment to develop and execute Beam pipeline interactively.
> > >
> > > However, there are many issues with existing iBeam. One issue is that
> it
> > > uses a concept of leaf PCollection to cache and materialize
> intermediate
> > > PCollection. If the user wants to reuse/introspect a non-leaf
> PCollection,
> > > the interactive runner will run into errors.
> > >
> > > Our initial effort will be fixing the existing issues. And we also
> want to
> > > make iBeam easy to use. Since iBeam uses the same model Beam uses,
> there
> > > isn't really any difference for users between creating a pipeline with
> > > interactive runner and other runners.
> > > So we want to minimize the interfaces a user needs to learn while
> giving
> > > the user some capability to interact with the interactive environment.
> > >
> > > See this initial PR , the
> > > interactive_beam module will provide mainly 4 interfaces:
> > >
> > >- For advanced users who define pipeline outside __main__, let them
> > >tell current interactive environment where they define their
> pipeline:
> > >watch()
> > >   - This is very useful for tests where pipeline can be defined in
> > >   test methods.
> > >   - If the user simply creates pipeline in a Jupyter notebook or a
> > >   plain Python script, they don't have to know/use this feature at
> all.
> > >- Let users create an interactive pipeline: create_pipeline()
> > >   - invoking create_pipeline(), the user gets a Pipeline object
> that
> > >   works as any other Pipeline object created from
> apache_beam.Pipeline()
> > >   - However, the pipeline object p, 

Re: [VOTE] Release 2.15.0, release candidate #1

2019-08-14 Thread Thomas Weise
The release announcement blog should be part of the review as well.


On Wed, Aug 14, 2019 at 2:44 PM Chamikara Jayalath 
wrote:

> Thanks Yifan.
>
> FYI PR was merged: https://github.com/apache/beam/pull/9342
>
>
>
> On Wed, Aug 14, 2019 at 2:32 PM Yifan Zou  wrote:
>
>> Sure. Would you please merge it to release-2.15.0, and I'll rebuild
>> artifacts and share the RC2?
>>
>> On Wed, Aug 14, 2019 at 1:53 PM Chamikara Jayalath 
>> wrote:
>>
>>> Can we rebuild to include the fix for
>>> https://issues.apache.org/jira/browse/BEAM-7866 which was merged
>>> yesterday ? (looks like fix version was changed to 2.16.0)
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Wed, Aug 14, 2019 at 1:42 PM Yifan Zou  wrote:
>>>
 Hi everyone,
 Please review and vote on the release candidate #1 for the version
 2.15.0, as follows:
 [ ] +1, Approve the release
 [ ] -1, Do not approve the release (please provide specific comments)


 The complete staging area is available for your review, which includes:
 * JIRA release notes [1].
 * The official Apache source release to be deployed to dist.apache.org
 [2], which is signed with the key with fingerprint
 AC9DB4F14CC90F37080F2C5B6D18F9A7F8DA25E1 [3].
 * All artifacts to be deployed to the Maven Central Repository [4].
 * Source code tag "v2.15.0-RC1" [5].
 * Website pull request listing the release [6], publishing the API
 reference manual [7].
 * Python artifacts are deployed along with the source release to the
 dist.apache.org [2].
 * Validation sheet with a tab for 2.15.0 release to help with
 validation [8].

 The vote will be open for at least 72 hours. It is adopted by majority
 approval, with at least 3 PMC affirmative votes.

 Thanks,
 Yifan Zou

 [1]
 https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345489
 [2] https://dist.apache.org/repos/dist/dev/beam/2.15.0/
 [3] https://dist.apache.org/repos/dist/release/beam/KEYS
 [4]
 https://repository.apache.org/content/repositories/orgapachebeam-1081/
 [5] https://github.com/apache/beam/tree/v2.15.0-RC1
 [6] https://github.com/apache/beam/pull/9341
 [7] https://github.com/apache/beam-site/pull/592
 [8]
 https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?pli=1#gid=1036192804

>>>


Re: [VOTE] Release 2.15.0, release candidate #1

2019-08-14 Thread Chamikara Jayalath
Thanks Yifan.

FYI PR was merged: https://github.com/apache/beam/pull/9342



On Wed, Aug 14, 2019 at 2:32 PM Yifan Zou  wrote:

> Sure. Would you please merge it to release-2.15.0, and I'll rebuild
> artifacts and share the RC2?
>
> On Wed, Aug 14, 2019 at 1:53 PM Chamikara Jayalath 
> wrote:
>
>> Can we rebuild to include the fix for
>> https://issues.apache.org/jira/browse/BEAM-7866 which was merged
>> yesterday ? (looks like fix version was changed to 2.16.0)
>>
>> Thanks,
>> Cham
>>
>> On Wed, Aug 14, 2019 at 1:42 PM Yifan Zou  wrote:
>>
>>> Hi everyone,
>>> Please review and vote on the release candidate #1 for the version
>>> 2.15.0, as follows:
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>>
>>> The complete staging area is available for your review, which includes:
>>> * JIRA release notes [1].
>>> * The official Apache source release to be deployed to dist.apache.org
>>> [2], which is signed with the key with fingerprint
>>> AC9DB4F14CC90F37080F2C5B6D18F9A7F8DA25E1 [3].
>>> * All artifacts to be deployed to the Maven Central Repository [4].
>>> * Source code tag "v2.15.0-RC1" [5].
>>> * Website pull request listing the release [6], publishing the API
>>> reference manual [7].
>>> * Python artifacts are deployed along with the source release to the
>>> dist.apache.org [2].
>>> * Validation sheet with a tab for 2.15.0 release to help with validation
>>> [8].
>>>
>>> The vote will be open for at least 72 hours. It is adopted by majority
>>> approval, with at least 3 PMC affirmative votes.
>>>
>>> Thanks,
>>> Yifan Zou
>>>
>>> [1]
>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345489
>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.15.0/
>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> [4]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1081/
>>> [5] https://github.com/apache/beam/tree/v2.15.0-RC1
>>> [6] https://github.com/apache/beam/pull/9341
>>> [7] https://github.com/apache/beam-site/pull/592
>>> [8]
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?pli=1#gid=1036192804
>>>
>>


Re: [VOTE] Release 2.15.0, release candidate #1

2019-08-14 Thread Yifan Zou
Sure. Would you please merge it to release-2.15.0, and I'll rebuild
artifacts and share the RC2?

On Wed, Aug 14, 2019 at 1:53 PM Chamikara Jayalath 
wrote:

> Can we rebuild to include the fix for
> https://issues.apache.org/jira/browse/BEAM-7866 which was merged
> yesterday ? (looks like fix version was changed to 2.16.0)
>
> Thanks,
> Cham
>
> On Wed, Aug 14, 2019 at 1:42 PM Yifan Zou  wrote:
>
>> Hi everyone,
>> Please review and vote on the release candidate #1 for the version
>> 2.15.0, as follows:
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>>
>>
>> The complete staging area is available for your review, which includes:
>> * JIRA release notes [1].
>> * The official Apache source release to be deployed to dist.apache.org
>> [2], which is signed with the key with fingerprint
>> AC9DB4F14CC90F37080F2C5B6D18F9A7F8DA25E1 [3].
>> * All artifacts to be deployed to the Maven Central Repository [4].
>> * Source code tag "v2.15.0-RC1" [5].
>> * Website pull request listing the release [6], publishing the API
>> reference manual [7].
>> * Python artifacts are deployed along with the source release to the
>> dist.apache.org [2].
>> * Validation sheet with a tab for 2.15.0 release to help with validation
>> [8].
>>
>> The vote will be open for at least 72 hours. It is adopted by majority
>> approval, with at least 3 PMC affirmative votes.
>>
>> Thanks,
>> Yifan Zou
>>
>> [1]
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345489
>> [2] https://dist.apache.org/repos/dist/dev/beam/2.15.0/
>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> [4]
>> https://repository.apache.org/content/repositories/orgapachebeam-1081/
>> [5] https://github.com/apache/beam/tree/v2.15.0-RC1
>> [6] https://github.com/apache/beam/pull/9341
>> [7] https://github.com/apache/beam-site/pull/592
>> [8]
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?pli=1#gid=1036192804
>>
>


Re: [VOTE] Release 2.15.0, release candidate #1

2019-08-14 Thread Chamikara Jayalath
Can we rebuild to include the fix for
https://issues.apache.org/jira/browse/BEAM-7866 which was merged yesterday
? (looks like fix version was changed to 2.16.0)

Thanks,
Cham

On Wed, Aug 14, 2019 at 1:42 PM Yifan Zou  wrote:

> Hi everyone,
> Please review and vote on the release candidate #1 for the version 2.15.0,
> as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1].
> * The official Apache source release to be deployed to dist.apache.org
> [2], which is signed with the key with fingerprint
> AC9DB4F14CC90F37080F2C5B6D18F9A7F8DA25E1 [3].
> * All artifacts to be deployed to the Maven Central Repository [4].
> * Source code tag "v2.15.0-RC1" [5].
> * Website pull request listing the release [6], publishing the API
> reference manual [7].
> * Python artifacts are deployed along with the source release to the
> dist.apache.org [2].
> * Validation sheet with a tab for 2.15.0 release to help with validation
> [8].
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Yifan Zou
>
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345489
> [2] https://dist.apache.org/repos/dist/dev/beam/2.15.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1081/
>
> [5] https://github.com/apache/beam/tree/v2.15.0-RC1
> [6] https://github.com/apache/beam/pull/9341
> [7] https://github.com/apache/beam-site/pull/592
> [8]
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?pli=1#gid=1036192804
>


[VOTE] Release 2.15.0, release candidate #1

2019-08-14 Thread Yifan Zou
Hi everyone,
Please review and vote on the release candidate #1 for the version 2.15.0,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


The complete staging area is available for your review, which includes:
* JIRA release notes [1].
* The official Apache source release to be deployed to dist.apache.org [2],
which is signed with the key with fingerprint
AC9DB4F14CC90F37080F2C5B6D18F9A7F8DA25E1 [3].
* All artifacts to be deployed to the Maven Central Repository [4].
* Source code tag "v2.15.0-RC1" [5].
* Website pull request listing the release [6], publishing the API
reference manual [7].
* Python artifacts are deployed along with the source release to the
dist.apache.org [2].
* Validation sheet with a tab for 2.15.0 release to help with validation
[8].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Yifan Zou

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345489
[2] https://dist.apache.org/repos/dist/dev/beam/2.15.0/
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1081/
[5] https://github.com/apache/beam/tree/v2.15.0-RC1
[6] https://github.com/apache/beam/pull/9341
[7] https://github.com/apache/beam-site/pull/592
[8]
https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?pli=1#gid=1036192804


Re: Brief of interactive Beam

2019-08-14 Thread Ning Kang
Q1:
The document is shared 
(https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing).
 If inside Google, short link (go/ibeam-external). I cannot share internal 
documents, but you can reach out if you need internal engineering plan.

Q2:
Yes, watch() is optimization used for using visualization() and building 
further on the pipeline. And the user doesn't need to call it if they simply 
define the pipeline in the notebook.

Q3 and Q4:
I'm only focusing on direct runner as underlying runner. We'll get rid of many 
of existing interactive Beam implementation. We can't provide portability for 
interactivity. Users can run the pipeline with other runners though due to the 
pipeline portability.
Our work is to reduce the new concepts a user needs to know when they want to 
run interactive Beam. The implementation could be arbitrarily complicated and 
open sourced though. Currently, the interactive runner looks like as if it's 
supporting all kinds of underlying runners. We want to rid of it too.

On 2019/08/08 00:01:06, Ahmet Altay  wrote: 
> Ning, thank you for the heads up.
> 
> All, this is a proposed work for improving interactive Beam experience. As
> mentioned in Ning's email, new concepts are being introduced. And in
> addition iBeam as a name is used as a new reference. I hope that bringing
> the discussion to the mailing list will give it the additional
> visibility and more people could share their feedback.
> 
> (cc'ing a few folks that might be interested +Robert Bradshaw
>  +Valentyn Tymofieiev  +Sindy Li
>  +Brian Hulette  )
> 
> Ahmet
> 
> 
> On Wed, Aug 7, 2019 at 12:36 PM Ning Kang  wrote:
> 
> > To whom may concern,
> >
> > This is Ning from Google. We are currently making efforts to leverage an
> > interactive runner under python beam sdk.
> >
> > There is already an interactive Beam (iBeam for short) runner with jupyter
> > notebook in the repo
> > 
> > .
> > Following the instructions on that page, one can set up an interactive
> > environment to develop and execute Beam pipeline interactively.
> >
> > However, there are many issues with existing iBeam. One issue is that it
> > uses a concept of leaf PCollection to cache and materialize intermediate
> > PCollection. If the user wants to reuse/introspect a non-leaf PCollection,
> > the interactive runner will run into errors.
> >
> > Our initial effort will be fixing the existing issues. And we also want to
> > make iBeam easy to use. Since iBeam uses the same model Beam uses, there
> > isn't really any difference for users between creating a pipeline with
> > interactive runner and other runners.
> > So we want to minimize the interfaces a user needs to learn while giving
> > the user some capability to interact with the interactive environment.
> >
> > See this initial PR , the
> > interactive_beam module will provide mainly 4 interfaces:
> >
> >- For advanced users who define pipeline outside __main__, let them
> >tell current interactive environment where they define their pipeline:
> >watch()
> >   - This is very useful for tests where pipeline can be defined in
> >   test methods.
> >   - If the user simply creates pipeline in a Jupyter notebook or a
> >   plain Python script, they don't have to know/use this feature at all.
> >- Let users create an interactive pipeline: create_pipeline()
> >   - invoking create_pipeline(), the user gets a Pipeline object that
> >   works as any other Pipeline object created from apache_beam.Pipeline()
> >   - However, the pipeline object p, when invoking p.run(), does some
> >   extra interactive magic.
> >   - We'll support interactive execution for DirectRunner at this
> >   moment.
> >- Let users run the interactive pipeline as a normal pipeline:
> >run_pipeline()
> >   - In an interactive environment, a user only needs to add and
> >   execute 1 line of code run_pipeline(pipeline) to execute any existing
> >   interactive pipeline object as normal pipeline in any selected 
> > platform.
> >   - We'll probably support Dataflow only. Other implementations can
> >   be added though.
> >- Let users introspect any intermediate PCollection they have handler
> >to: visualize()
> >   - If a user ever writes pcoll = p | "Some Transform" >>
> >   some_transform() ..., they can visualize(pcoll) once the pipeline p is
> >   executed.
> >   - p can be batch or streaming
> >   - The visualization will be some plot graph of data for the given
> >   PCollection as if it's materialized. If the PCollection is unbounded, 
> > the
> >   graph is dynamic.
> >
> > The PR will implement 1 and 2.
> >
> > We'll use https://issues.apache.org/jira/browse/BEAM-7923 as the top
> > level JIRA and add blocking JIRAs as development goes.
> 

Re: Brief of interactive Beam

2019-08-14 Thread Ning Kang
Ahmet, thanks for forwarding!


> My main concern at this point is the introduction of new concepts, even
> though these are not changing other parts of the Beam SDKs. It would be
> good to see at least an alternative option covered in the design document.
> The reason is each additional concept adds to the mental load of users. And
> also concepts from interactive Beam will shift user's expectations of Beam
> even though there are not direct SDK modifications.


Hi Robert. About the concern, I think I have a few points:

   1. *Interactive Beam (or Interactive Runner) is already an existing "new
   concept" that normal Beam user could opt-in if they want an interactive
   Beam experience.* They need to do lots of setup steps and learn new
   things such as Jupyter notebook and at least interactive_runner module to
   make it work and make use of it.
   2. *The behavior of existing interactive Beam is different from normal
   Beam because of the interactive nature and the users would expect that.* And
   the users wouldn't shift their expectation of normal Beam. Just like
   running Python scripts might result in different behavior than running all
   of them in an interactive Python session. Or if a user runs a Beam pipeline
   with direct runner, they should expect the behavior be different from
   running it on Dataflow while a user needs GCP account. I think the users
   are aware of the difference when they choose to use Interactive Beam.
   3. *Our design actually reduces the mental load of interactive Beam
   users with intuitive interactive features*: create pipeline, visualize
   intermediate PCollection, run pipeline at some point with other runners and
   etc. For example, right now, the user needs to use a more complicated set
   of libraries, like creating a Beam pipeline with interactive runner that
   needs an underlying runner fed in.  We are getting rid of it. An
   interactive Beam user shouldn't be concerned about the underlying
   interactive magic. The interactive experience should be tailored for
   different underlying runners. There is no portability of interactivity and
   users opt-in interactive Beam using notebook would naturally expect
   something similar to the direct runner.
   4. *When users run pipeline built from interactive runner in a
   non-interactive environment, it's direct runner like any other Beam
   tutorial demonstrates*. It's even easier because the user doesn't need
   to specify the runner nor pass in options.
   5. *Interactive Beam is solving an orthogonal set of problems than Beam*.
   You can think of it as a wrapper of Beam that enables interactivity and
   it's not even a real runner. It doesn't change the Beam model such as how
   you build a pipeline. And with the Beam portability, you get the capability
   to run the pipeline built from interactive runner with other runners for
   free. It adds the interactive behavior that a user expects.
   6. *We want to open source it though we can iterate faster without doing
   it*. The whole project can be encapsulated in a completely irrelevant
   repository and from a developer's perspective, I want to hide all the
   implementation details from the interactive Beam user. However, as there is
   more and more desire for interactive Beam (+Mehran Nazir
for more details), we want to share the
   implementation with others who want to contribute and explore the
   interactive world.

Also +David Yan   for more opinions.

Thanks!

Ning.

On Tue, Aug 13, 2019 at 6:00 PM Ahmet Altay  wrote:

> Ning, I believe Robert's questions from his email has not been answered
> yet.
>
> On Tue, Aug 13, 2019 at 5:00 PM Ning Kang  wrote:
>
>> Hi all, I'll leave another 3 days for design
>> 
>>  review.
>> Then we can have a vote session if there is no objection.
>>
>> Thanks!
>>
>> On Fri, Aug 9, 2019 at 12:14 PM Ning Kang  wrote:
>>
>>> Thanks Ahmet for the introduction!
>>>
>>> I've composed a design overview
>>> 
>>> describing changes we are making to components around interactive runner.
>>> I'll share the document in our email thread too.
>>>
>>> The truth is since interactive runner is not yet a recognized runner as
>>> part of the Beam SDK (and it's fundamentally a wrapper around direct
>>> runner), we are not touching any Beam SDK components.
>>> We'll not change any behavior of existing Beam SDK and we'll try our
>>> best to keep it that way in the future.
>>>
>>
> My main concern at this point is the introduction of new concepts, even
> though these are not changing other parts of the Beam SDKs. It would be
> good to see at least an alternative option covered in the design document.
> The reason is each additional concept adds to the mental load of users. And
> also concepts from interactive Beam will shift user's 

Re: Write-through-cache in State logic

2019-08-14 Thread Maximilian Michels
For the purpose of my own understanding of the matter, I've created a
document:
https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/


It could make sense to clarify and specify things in there for now. I'm
more than willing to consolidate this document with the caching section
in the Fn API document.

-Max

On 14.08.19 17:13, Lukasz Cwik wrote:
> Instead of starting a new doc, could we add/update the caching segment
> of https://s.apache.org/beam-fn-state-api-and-bundle-processing?
> 
> Everyone has comment access and all Apache Beam PMC can add themselves
> to be editors since the doc is owned by the Apache Beam PMC gmail acocunt.
> 
> On Wed, Aug 14, 2019 at 7:01 AM Maximilian Michels  > wrote:
> 
> Yes, that makes sense. What do you think about creating a document to
> summarize the ideas presented here? Also, it would be good to capture
> the status quo regarding caching in the Python SDK.
> 
> -Max
> 
> On 13.08.19 22:44, Thomas Weise wrote:
> > The token would be needed in general to invalidate the cache when
> > bundles are processed by different workers.
> >
> > In the case of the Flink runner we don't have a scenario of SDK worker
> > surviving the runner in the case of a failure, so there is no
> > possibility of inconsistent state as result of a checkpoint failure.
> >
> > --
> > sent from mobile
> >
> > On Tue, Aug 13, 2019, 1:18 PM Maximilian Michels  
> > >> wrote:
> >
> >     Thanks for clarifying. Cache-invalidation for side inputs
> makes sense.
> >
> >     In case the Runner fails to checkpoint, could it not
> re-attempt the
> >     checkpoint? At least in the case of Flink, the cache would
> still be
> >     valid until another checkpoint is attempted. For other Runners
> that may
> >     not be the case. Also, rolling back state while keeping the
> SDK Harness
> >     running requires to invalidate the cache.
> >
> >     -Max
> >
> >     On 13.08.19 18:09, Lukasz Cwik wrote:
> >     >
> >     >
> >     > On Tue, Aug 13, 2019 at 4:36 AM Maximilian Michels
> mailto:m...@apache.org>
> >     >
> >     > 
>  >     >
> >     >     Agree that we have to be able to flush before a
> checkpoint to
> >     avoid
> >     >     caching too many elements. Also good point about
> checkpoint costs
> >     >     increasing with flushing the cache on checkpoints. A LRU
> cache
> >     policy in
> >     >     the SDK seems desirable.
> >     >
> >     >     What is the role of the cache token in the design
> document[1]?
> >     It looks
> >     >     to me that the token is used to give the Runner control over
> >     which and
> >     >     how many elements can be cached by the SDK. Why is that
> necessary?
> >     >     Shouldn't this be up to the SDK?
> >     >
> >     >  
> >     > We want to be able to handle the case where the SDK
> completes the
> >     bundle
> >     > successfully but the runner fails to checkpoint the information.
> >     > We also want the runner to be able to pass in cache tokens
> for things
> >     > like side inputs which may change over time (and the SDK
> would not
> >     know
> >     > that this happened).
> >     >  
> >     >
> >     >     -Max
> >     >
> >     >     [1]
> >     >   
> >   
>   
> https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m
> >     >
> >     >     Is it simply to
> >     >     On 12.08.19 19:55, Lukasz Cwik wrote:
> >     >     >
> >     >     >
> >     >     > On Mon, Aug 12, 2019 at 10:09 AM Thomas Weise
> >     mailto:t...@apache.org>  >
> >     >     
> >>
> >     >     > 
> >
> >     
>  wrote:
> >     >     >
> >     >     >
> >     >     >     On Mon, Aug 12, 2019 at 8:53 AM Maximilian Michels
> >     >     mailto:m...@apache.org>
> >
> 
> >     >>
> >     >     >     
> 

Re: Write-through-cache in State logic

2019-08-14 Thread Maximilian Michels
Yes, that makes sense. What do you think about creating a document to
summarize the ideas presented here? Also, it would be good to capture
the status quo regarding caching in the Python SDK.

-Max

On 13.08.19 22:44, Thomas Weise wrote:
> The token would be needed in general to invalidate the cache when
> bundles are processed by different workers.
> 
> In the case of the Flink runner we don't have a scenario of SDK worker
> surviving the runner in the case of a failure, so there is no
> possibility of inconsistent state as result of a checkpoint failure.
> 
> --
> sent from mobile
> 
> On Tue, Aug 13, 2019, 1:18 PM Maximilian Michels  > wrote:
> 
> Thanks for clarifying. Cache-invalidation for side inputs makes sense.
> 
> In case the Runner fails to checkpoint, could it not re-attempt the
> checkpoint? At least in the case of Flink, the cache would still be
> valid until another checkpoint is attempted. For other Runners that may
> not be the case. Also, rolling back state while keeping the SDK Harness
> running requires to invalidate the cache.
> 
> -Max
> 
> On 13.08.19 18:09, Lukasz Cwik wrote:
> >
> >
> > On Tue, Aug 13, 2019 at 4:36 AM Maximilian Michels  
> > >> wrote:
> >
> >     Agree that we have to be able to flush before a checkpoint to
> avoid
> >     caching too many elements. Also good point about checkpoint costs
> >     increasing with flushing the cache on checkpoints. A LRU cache
> policy in
> >     the SDK seems desirable.
> >
> >     What is the role of the cache token in the design document[1]?
> It looks
> >     to me that the token is used to give the Runner control over
> which and
> >     how many elements can be cached by the SDK. Why is that necessary?
> >     Shouldn't this be up to the SDK?
> >
> >  
> > We want to be able to handle the case where the SDK completes the
> bundle
> > successfully but the runner fails to checkpoint the information.
> > We also want the runner to be able to pass in cache tokens for things
> > like side inputs which may change over time (and the SDK would not
> know
> > that this happened).
> >  
> >
> >     -Max
> >
> >     [1]
> >   
>  
> https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m
> >
> >     Is it simply to
> >     On 12.08.19 19:55, Lukasz Cwik wrote:
> >     >
> >     >
> >     > On Mon, Aug 12, 2019 at 10:09 AM Thomas Weise
> mailto:t...@apache.org>
> >     >
> >     > 
>  >     >
> >     >
> >     >     On Mon, Aug 12, 2019 at 8:53 AM Maximilian Michels
> >     mailto:m...@apache.org>  >
> >     >     
>  >     >
> >     >         Thanks for starting this discussion Rakesh. An
> efficient cache
> >     >         layer is
> >     >         one of the missing pieces for good performance in
> stateful
> >     >         pipelines.
> >     >         The good news are that there is a level of caching
> already
> >     >         present in
> >     >         Python which batches append requests until the bundle is
> >     finished.
> >     >
> >     >         Thomas, in your example indeed we would have to
> profile to see
> >     >         why CPU
> >     >         utilization is high on the Flink side but not in the
> >     Python SDK
> >     >         harness.
> >     >         For example, older versions of Flink (<=1.5) have a high
> >     cost of
> >     >         deleting existing instances of a timer when setting
> a timer.
> >     >         Nevertheless, cross-bundle caching would likely
> result in
> >     increased
> >     >         performance.
> >     >
> >     >
> >     >     CPU on the Flink side was unchanged, and that's
> important. The
> >     >     throughout improvement comes from the extended bundle
> caching
> >     on the
> >     >     SDK side. That's what tells me that cross-bundle caching is
> >     needed.
> >     >     Of course, it will require a good solution for the write
> also
> >     and I
> >     >     like your idea of using the checkpoint boundary for that,
> >     especially
> >     >     since that already aligns with the bundle boundary and
> is under
> >     >     runner control. Of course we also want to be careful to
> not 

Re: jira access

2019-08-14 Thread Sebastian Jambor
Thanks Max!

On Wed, Aug 14, 2019 at 12:47 PM Maximilian Michels  wrote:

> Hi Sebastian,
>
> Welcome! I've added you as a contributor in JIRA.
>
> Cheers,
> Max
>
> On 14.08.19 11:54, Sebastian Jambor wrote:
> > Hi,
> >
> > I'm Sebastian, working for Trifacta. We use Beam for dataflow jobs in
> > Cloud Dataprep. Could someone add me as a contributor to jira? My jira
> > id is sgrj.
> >
> > Thanks,
> > Sebastian
>


Re: jira access

2019-08-14 Thread Maximilian Michels
Hi Sebastian,

Welcome! I've added you as a contributor in JIRA.

Cheers,
Max

On 14.08.19 11:54, Sebastian Jambor wrote:
> Hi,
> 
> I'm Sebastian, working for Trifacta. We use Beam for dataflow jobs in
> Cloud Dataprep. Could someone add me as a contributor to jira? My jira
> id is sgrj.
> 
> Thanks,
> Sebastian


jira access

2019-08-14 Thread Sebastian Jambor
Hi,

I'm Sebastian, working for Trifacta. We use Beam for dataflow jobs in Cloud
Dataprep. Could someone add me as a contributor to jira? My jira id is sgrj.

Thanks,
Sebastian