[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=323782=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-323782
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 04/Oct/19 22:38
Start Date: 04/Oct/19 22:38
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 323782)
Time Spent: 18h  (was: 17h 50m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-py-interactive
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 18h
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=323781=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-323781
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 04/Oct/19 22:37
Start Date: 04/Oct/19 22:37
Worklog Time Spent: 10m 
  Work Description: aaltay commented on issue #9619: [BEAM-7760] Added 
pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#issuecomment-538581949
 
 
   @KevinGG thank you for answering my questions. I think this is ready to 
merge.
   
   However documentation, and concepts are complex. Please make an overall pass 
and improve this and make it easy for new folks to contribute.
   
   (cc: @davidyan74 @rohdesamuel other folks working in this area.)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 323781)
Time Spent: 17h 50m  (was: 17h 40m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-py-interactive
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 17h 50m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=323777=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-323777
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 04/Oct/19 22:35
Start Date: 04/Oct/19 22:35
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r331707490
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=323779=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-323779
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 04/Oct/19 22:35
Start Date: 04/Oct/19 22:35
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r331707902
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=323780=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-323780
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 04/Oct/19 22:35
Start Date: 04/Oct/19 22:35
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r331708045
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=323778=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-323778
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 04/Oct/19 22:35
Start Date: 04/Oct/19 22:35
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r331707226
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321595=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321595
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 22:24
Start Date: 01/Oct/19 22:24
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330247591
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321593=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321593
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 22:24
Start Date: 01/Oct/19 22:24
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330247591
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321592=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321592
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 22:23
Start Date: 01/Oct/19 22:23
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330247591
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321591=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321591
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 22:23
Start Date: 01/Oct/19 22:23
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330242462
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321590=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321590
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 22:22
Start Date: 01/Oct/19 22:22
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330242462
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321588=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321588
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 22:20
Start Date: 01/Oct/19 22:20
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330266912
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321587=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321587
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 22:19
Start Date: 01/Oct/19 22:19
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on issue #9619: [BEAM-7760] Added 
pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#issuecomment-537256427
 
 
   @aaltay I've added comments and made some change to the original PR in the 
2nd commit.
   I'll draft a "vocabulary" doc for us to vote and discuss names of different 
components in an Interactive Beam notebook.
   PTAL. Thank you very much!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 321587)
Time Spent: 16h  (was: 15h 50m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 16h
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321557=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321557
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 21:38
Start Date: 01/Oct/19 21:38
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330288300
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321556=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321556
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 21:38
Start Date: 01/Oct/19 21:38
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330288300
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321550=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321550
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 21:31
Start Date: 01/Oct/19 21:31
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330285605
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321532=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321532
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 21:09
Start Date: 01/Oct/19 21:09
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330201369
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321531=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321531
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 21:09
Start Date: 01/Oct/19 21:09
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330201369
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321529=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321529
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 21:04
Start Date: 01/Oct/19 21:04
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330266912
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321523=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321523
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 20:57
Start Date: 01/Oct/19 20:57
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330202786
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321521=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321521
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 20:53
Start Date: 01/Oct/19 20:53
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330197374
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
 
 Review comment:
   Inside this module, the pipeline instance is actually modified (with 
read/write cache transforms). This approach takes a snapshot of the original 
pipeline before mutating it so that the invoking [Interactive]Runner can always 
recover the original pipeline.
   
   Note: the pipeline instance received by this module is from the runner. It's 
highly possible that a round-trip between pipeline and runner_api proto has 
been done by the runner itself.
   For example, the pipeline instance the notebook user has defined (and 
continue developing) in their notebook will not be the same instance (but a 
copy) used during run_pipeline(pipeline) by InteractiveRunner.
   This module will receive that copied pipeline instance, 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321519=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321519
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 20:47
Start Date: 01/Oct/19 20:47
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330201369
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321518=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321518
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 20:46
Start Date: 01/Oct/19 20:46
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330266912
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321506=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321506
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 20:32
Start Date: 01/Oct/19 20:32
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330201369
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321487=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321487
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 20:00
Start Date: 01/Oct/19 20:00
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330247591
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321481=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321481
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 19:52
Start Date: 01/Oct/19 19:52
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330244341
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321477=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321477
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 19:50
Start Date: 01/Oct/19 19:50
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330242462
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321476=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321476
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 19:50
Start Date: 01/Oct/19 19:50
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330242462
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321473=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321473
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 19:49
Start Date: 01/Oct/19 19:49
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330242462
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321475=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321475
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 19:49
Start Date: 01/Oct/19 19:49
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330242462
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321472=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321472
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 19:48
Start Date: 01/Oct/19 19:48
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330242462
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321423=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321423
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 18:20
Start Date: 01/Oct/19 18:20
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330203922
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321420=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321420
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 18:17
Start Date: 01/Oct/19 18:17
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330202786
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321417=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321417
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 18:14
Start Date: 01/Oct/19 18:14
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330201369
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321416=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321416
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 18:12
Start Date: 01/Oct/19 18:12
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330200294
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321415=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321415
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 18:12
Start Date: 01/Oct/19 18:12
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330200294
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321411=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321411
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 18:05
Start Date: 01/Oct/19 18:05
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330197374
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
 
 Review comment:
   Inside this module, the pipeline instance is actually modified (with 
read/write cache transforms). This approach takes a snapshot of the original 
pipeline before mutating it so that the invoking [Interactive]Runner can always 
recover the original pipeline.
   
   Note: the pipeline instance received by this module is from the runner. It's 
highly possible that a round-trip between pipeline and runner_api proto has 
been done by the runner itself.
   For example, the pipeline instance the notebook user has defined (and 
continue developing) in their notebook will not be the same instance (but a 
copy) used during run_pipeline(pipeline) by InteractiveRunner.
   This module will receive that copied pipeline instance, 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321404=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321404
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 17:58
Start Date: 01/Oct/19 17:58
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330193612
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
 
 Review comment:
   There is some work from Alexey to decouple the caching from a centralized 
cache manager instance. So a cache manager might not exist anymore. Doing this 
to avoid exposing it in the constructor. When new caching modules checked in, 
swap with this.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 321404)
Time Spent: 11h 50m  (was: 11h 40m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 11h 50m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321403=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321403
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 17:58
Start Date: 01/Oct/19 17:58
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330193612
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
 
 Review comment:
   There is some work from Alexey to decouple the caching from a centralized 
cache manager instance. So a cache manager might not exist anymore. Doing this 
to avoid exposing it in the constructor. When new caching modules checked in, 
swap this implicitly.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 321403)
Time Spent: 11h 40m  (was: 11.5h)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 11h 40m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-10-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=321402=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-321402
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 01/Oct/19 17:57
Start Date: 01/Oct/19 17:57
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r330193612
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
 
 Review comment:
   There is some work from Alexey to decouple the caching from a centralized 
cache manager instance. So a cache manager might not exist anymore. Doing this 
to avoid exposing it in the constructor.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 321402)
Time Spent: 11.5h  (was: 11h 20m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 11.5h
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=319068=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319068
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 26/Sep/19 16:34
Start Date: 26/Sep/19 16:34
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on issue #9619: [BEAM-7760] Added 
pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#issuecomment-535585079
 
 
   @aaltay Thanks for the quick response! I'm currently oncall till next 
Tuesday. I'll get back to the PR once I'm off duty asap. Thank you very much 
for the detailed review!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319068)
Time Spent: 11h 20m  (was: 11h 10m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 11h 20m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=318694=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-318694
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 26/Sep/19 01:22
Start Date: 26/Sep/19 01:22
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r328399385
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=318689=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-318689
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 26/Sep/19 01:22
Start Date: 26/Sep/19 01:22
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r328400374
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=318698=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-318698
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 26/Sep/19 01:22
Start Date: 26/Sep/19 01:22
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r328399807
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=318692=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-318692
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 26/Sep/19 01:22
Start Date: 26/Sep/19 01:22
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r328400222
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=318690=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-318690
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 26/Sep/19 01:22
Start Date: 26/Sep/19 01:22
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r328398936
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
 
 Review comment:
   Curious, why not pass this information through the constructor?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 318690)
Time Spent: 10h  (was: 9h 50m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 10h
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=318695=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-318695
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 26/Sep/19 01:22
Start Date: 26/Sep/19 01:22
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r328401080
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=318696=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-318696
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 26/Sep/19 01:22
Start Date: 26/Sep/19 01:22
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r328399971
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=318687=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-318687
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 26/Sep/19 01:22
Start Date: 26/Sep/19 01:22
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r328399145
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
 
 Review comment:
   Would not the original be the result of 
`pipeline.to_runner_api(use_fake_coders=True)` ?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 318687)
Time Spent: 9.5h  (was: 9h 20m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
>  

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=318688=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-318688
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 26/Sep/19 01:22
Start Date: 26/Sep/19 01:22
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r328400449
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=318697=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-318697
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 26/Sep/19 01:22
Start Date: 26/Sep/19 01:22
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r328401171
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=318693=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-318693
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 26/Sep/19 01:22
Start Date: 26/Sep/19 01:22
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r328400678
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=318691=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-318691
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 26/Sep/19 01:22
Start Date: 26/Sep/19 01:22
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r328401245
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+self._pipeline = pipeline
+# The cache manager should be initiated outside of this module and outside
+# of run_pipeline() from interactive runner so that its lifespan could 
cover
+# multiple runs in the interactive environment. Owned by
+# interactive_environment module. Not owned by this module.
+# TODO(BEAM-7760): change the scope of cache to be owned by runner or
+# pipeline result instances because a pipeline is not 1:1 correlated to a
+# running job. Only complete and read-only cache is valid across multiple
+# jobs. Other cache instances should have their own scopes. Some design
+# change should support only runner.run(pipeline) pattern rather than
+# pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+# time. Otherwise, result returned by run() is the only 1:1 anchor.
+self._cache_manager = ie.current_env().cache_manager()
+
+# Invoke a round trip through the runner API. This makes sure the Pipeline
+# proto is stable. The snapshot of pipeline will not be mutated within this
+# module and can be used to recover original pipeline if needed.
+self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+pipeline.to_runner_api(use_fake_coders=True),
+pipeline.runner,
+options)
+# Snapshot of original pipeline information.
+(self._original_pipeline_proto,
+ self._original_context) = self._pipeline_snap.to_runner_api(
+ return_context=True, use_fake_coders=True)
+
+# All compute-once-against-original-pipeline fields.
+self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+# relationships across pipelines, runners, and jobs.
+self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+  self._original_context)
+
+# A mapping from PCollection id to python id() value in user defined
+# pipeline instance.
+(self._pcoll_version_map,
+ self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+# A dict from cache key to PCollection that is read from cache.
+# If 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=315981=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-315981
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 20/Sep/19 22:50
Start Date: 20/Sep/19 22:50
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on issue #9619: [BEAM-7760] Added 
pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#issuecomment-533735184
 
 
   R:@aaltay
   PTAL
   I'll fix the PyLint checks in the PreCommit. There are many warning level 
(exit code 4) lint reports related to Python2 and Beam pipeline definition in 
unit test code that fail some Gradle tasks.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 315981)
Time Spent: 9h 20m  (was: 9h 10m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=315332=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-315332
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 19/Sep/19 22:31
Start Date: 19/Sep/19 22:31
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on issue #9619: [BEAM-7760] Added 
pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#issuecomment-53028
 
 
   retest this please
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 315332)
Time Spent: 9h 10m  (was: 9h)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=315312=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-315312
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 19/Sep/19 21:19
Start Date: 19/Sep/19 21:19
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619
 
 
   1. Added the pipeline_instrument module to automatically instrument a
   given Interactive Beam pipeline by mutating it with additional cache
   based PTransforms if available so that, within an interactive
   environment, each pipeline run could have effect against the future
   runs to provide an interactive experience when executing Beam
   pipelines.
   2. The pipeline_instrument module will replace pipeline_analyzer module
   when the integration with a re-written of display module interfaces is
   done since the interactivity instrument (i.e., parameters passed and
   used) has changed. Most of the display logic won't change.
   3. An optional pruning logic is marked as TODO so that when executing
   an instrumented pipeline, any sub graph doesn't generate new states
   should not be re-executed if implemented.
   4. Tests included.
   5. The philosophy is to keep the pipeline instance defined by user code
   intact and mutate directly on a copied pipeline instance; to always
   convert the instrumented pipeline to a portable pipeline and pass it
   to runner for execution; to maintain the mapping relationship from
   original user defined pipeline to instrumented copied pipeline instances
   and jobs executed by runners.
   6. Additional complexity occurs when there are multiple pipeline
   instances defined in user code, multiple runners instantiated, and
   multiple jobs running from those pipeline instances by the runner
   instances. Currently, the only guarantee is that a pipeline result
   bounded to a job must be the return of a run by a runner and originate
   from a pipeline instance. Some design change proposals are marked as TODO
   around the cache scoping to solve the context problem in interactive
   environment: when and what to instrument when the additional complexity
   occurs.
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [x] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=314758=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-314758
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 19/Sep/19 01:16
Start Date: 19/Sep/19 01:16
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 314758)
Time Spent: 8h 50m  (was: 8h 40m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=314059=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-314059
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 17/Sep/19 23:47
Start Date: 17/Sep/19 23:47
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on issue #9278: [BEAM-7760] Added 
Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#issuecomment-532443889
 
 
   > Can you check the test failures?
   
   Yes, applied a blank new line in-between differently styled imports to pass 
the lint check.
   Squashed all and rebased against upstream.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 314059)
Time Spent: 8.5h  (was: 8h 20m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 8.5h
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=313356=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-313356
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 17/Sep/19 00:04
Start Date: 17/Sep/19 00:04
Worklog Time Spent: 10m 
  Work Description: aaltay commented on issue #9278: [BEAM-7760] Added 
Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#issuecomment-532003840
 
 
   Can you check the test failures?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 313356)
Time Spent: 8h 20m  (was: 8h 10m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=313222=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-313222
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 16/Sep/19 18:01
Start Date: 16/Sep/19 18:01
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on issue #9278: [BEAM-7760] Added 
Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#issuecomment-531889317
 
 
   Rebased against upstream. The doc and check fixes are squashed into a 2nd 
commit. Will squash all when ready to merge.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 313222)
Time Spent: 8h 10m  (was: 8h)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312457=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312457
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 14/Sep/19 00:40
Start Date: 14/Sep/19 00:40
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on issue #9278: [BEAM-7760] Added 
Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#issuecomment-531430272
 
 
   Resolved the failed checks.
   
   - :sdks:python:test-suites:tox:py2:docs
   
   Example code block in Python docstring needs to conform to some special 
format to be parsable to generate documentation. Applied [Google docstring 
Style](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html).
   
   
   - :sdks:python:test-suites:tox:py2:lintPy27
   - :sdks:python:test-suites:tox:py2:lintPy27_3
   
   watch() tests without directly using local variables. Silenced the pylint 
warnings. Unused variables are intended in the tests.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312457)
Time Spent: 8h  (was: 7h 50m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312441=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312441
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 23:36
Start Date: 13/Sep/19 23:36
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324398846
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
+  __main__ module is always watched by default, you don't have to instruct
+  Interactive Beam. However, if your Beam pipeline is defined in some module
+  other than __main__, e.g., inside a class function or a unit test, you can
 
 Review comment:
   Applied.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312441)
Time Spent: 7h 50m  (was: 7h 40m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312438=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312438
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 23:26
Start Date: 13/Sep/19 23:26
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324397584
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
+  __main__ module is always watched by default, you don't have to instruct
+  Interactive Beam. However, if your Beam pipeline is defined in some module
+  other than __main__, e.g., inside a class function or a unit test, you can
+  watch() the scope to instruct the whereabouts of your pipeline definition so
+  Interactive Beam could apply interactivity to your pipeline when running it.
+
+For example:
+
+class Foo(object)
+  def build_pipeline(self):
+p = beam.Pipeline()
+init_pcoll = p |  'Init Create' >> beam.Create(range(10))
+watch(locals())
+return p
+Foo().build_pipeline().run()
+
+Interactive Beam will cache init_pcoll for the first run. You can use:
+
+visualize(init_pcoll)
 
 Review comment:
   I'll remove this part. It's not relevant to `watch()`.  The interactivity 
happens when the user incrementally adding cells to develop further pipeline 
from existing pipeline object without re-executing previous cells. And 
optimization to pipeline run is done so that previous parts of the pipeline are 
not re-executed.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312438)
Time Spent: 7.5h  (was: 7h 20m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312437=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312437
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 23:22
Start Date: 13/Sep/19 23:22
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324397014
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
 
 Review comment:
   I'll remove "a notebook or " to avoid the confusion. It doesn't matter what 
notebook the user uses. I think Python users know where __main__ module is even 
if they are using some notebook product such as a Jupyter notebook.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312437)
Time Spent: 7h 20m  (was: 7h 10m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312436=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312436
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 23:19
Start Date: 13/Sep/19 23:19
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324396546
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
+  __main__ module is always watched by default, you don't have to instruct
+  Interactive Beam. However, if your Beam pipeline is defined in some module
+  other than __main__, e.g., inside a class function or a unit test, you can
+  watch() the scope to instruct the whereabouts of your pipeline definition so
+  Interactive Beam could apply interactivity to your pipeline when running it.
+
+For example:
+
+class Foo(object)
+  def build_pipeline(self):
+p = beam.Pipeline()
+init_pcoll = p |  'Init Create' >> beam.Create(range(10))
+watch(locals())
+return p
+Foo().build_pipeline().run()
+
+Interactive Beam will cache init_pcoll for the first run. You can use:
 
 Review comment:
   Changed. And I will apply it in the future. Thanks!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312436)
Time Spent: 7h 10m  (was: 7h)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312435=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312435
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 23:18
Start Date: 13/Sep/19 23:18
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324396505
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
+  __main__ module is always watched by default, you don't have to instruct
 
 Review comment:
   Done.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312435)
Time Spent: 7h  (was: 6h 50m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312434=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312434
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 23:17
Start Date: 13/Sep/19 23:17
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324396351
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
+  __main__ module is always watched by default, you don't have to instruct
+  Interactive Beam. However, if your Beam pipeline is defined in some module
+  other than __main__, e.g., inside a class function or a unit test, you can
+  watch() the scope to instruct the whereabouts of your pipeline definition so
+  Interactive Beam could apply interactivity to your pipeline when running it.
 
 Review comment:
   Added "that".
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312434)
Time Spent: 6h 50m  (was: 6h 40m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312432=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312432
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 23:16
Start Date: 13/Sep/19 23:16
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324396147
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
 
 Review comment:
   Applied.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312432)
Time Spent: 6h 40m  (was: 6.5h)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312431=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312431
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 23:11
Start Date: 13/Sep/19 23:11
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324395429
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
 
 Review comment:
   Yes, it means "so that Interactive Beam has information on where your 
pipelines are defined".
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312431)
Time Spent: 6.5h  (was: 6h 20m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312430=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312430
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 23:11
Start Date: 13/Sep/19 23:11
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324395424
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
 
 Review comment:
   I'll move the later explanation of watchable to the front just beneath this 
line.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312430)
Time Spent: 6h 20m  (was: 6h 10m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>  

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312429=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312429
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 23:03
Start Date: 13/Sep/19 23:03
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324394268
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
 
 Review comment:
   Sure!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312429)
Time Spent: 6h 10m  (was: 6h)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312427=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312427
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 23:02
Start Date: 13/Sep/19 23:02
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324394137
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
 
 Review comment:
   Thanks! I'll apply
   > "Note: If you want backward-compatibility, only invoke interfaces provided 
by this module in your notebook or application code."
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312427)
Time Spent: 6h  (was: 5h 50m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312426=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312426
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 23:01
Start Date: 13/Sep/19 23:01
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on issue #9278: [BEAM-7760] Added 
Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#issuecomment-531416355
 
 
   > Could you check the failing test please?
   
   Thanks! Sure.
   Those are build failures from 3 tasks:
   - :sdks:python:test-suites:tox:py2:docs
   - :sdks:python:test-suites:tox:py2:lintPy27
   - :sdks:python:test-suites:tox:py2:lintPy27_3
   I'll update the docstring and comments following rosetn@ 's suggestions and 
format everything to pass these tasks.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312426)
Time Spent: 5h 50m  (was: 5h 40m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312396=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312396
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 22:07
Start Date: 13/Sep/19 22:07
Worklog Time Spent: 10m 
  Work Description: rosetn commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324373828
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
+  __main__ module is always watched by default, you don't have to instruct
+  Interactive Beam. However, if your Beam pipeline is defined in some module
+  other than __main__, e.g., inside a class function or a unit test, you can
+  watch() the scope to instruct the whereabouts of your pipeline definition so
+  Interactive Beam could apply interactivity to your pipeline when running it.
 
 Review comment:
   missing article "so that Interactive Beam"
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312396)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312393=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312393
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 22:07
Start Date: 13/Sep/19 22:07
Worklog Time Spent: 10m 
  Work Description: rosetn commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324372834
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
 
 Review comment:
   What do you mean by "understands"? Do you mean "so that Interactive Beam has 
information on where your pipelines are defined"?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312393)
Time Spent: 5.5h  (was: 5h 20m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312390=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312390
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 22:07
Start Date: 13/Sep/19 22:07
Worklog Time Spent: 10m 
  Work Description: rosetn commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324372970
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
 
 Review comment:
   Missing article
   
   "If you write a Beam pipeline"
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312390)
Time Spent: 5h 20m  (was: 5h 10m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312392=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312392
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 22:07
Start Date: 13/Sep/19 22:07
Worklog Time Spent: 10m 
  Work Description: rosetn commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324374455
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
+  __main__ module is always watched by default, you don't have to instruct
 
 Review comment:
   "since the \_\_main\_\_ module is always watched"
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312392)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312394=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312394
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 22:07
Start Date: 13/Sep/19 22:07
Worklog Time Spent: 10m 
  Work Description: rosetn commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324383800
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
+  __main__ module is always watched by default, you don't have to instruct
+  Interactive Beam. However, if your Beam pipeline is defined in some module
+  other than __main__, e.g., inside a class function or a unit test, you can
+  watch() the scope to instruct the whereabouts of your pipeline definition so
+  Interactive Beam could apply interactivity to your pipeline when running it.
+
+For example:
+
+class Foo(object)
+  def build_pipeline(self):
+p = beam.Pipeline()
+init_pcoll = p |  'Init Create' >> beam.Create(range(10))
+watch(locals())
+return p
+Foo().build_pipeline().run()
+
+Interactive Beam will cache init_pcoll for the first run. You can use:
+
+visualize(init_pcoll)
+
+To visualize data from init_pcoll once the pipeline is executed. And if you
+make change to the original pipeline by adding:
+
+squares = init_pcoll | 'Square' >> beam.Map(lambda x: x*x)
+
+When you re-run the pipeline from the line you just added, squares will
+use the init_pcoll data cached so you can have an interactive experience.
+
+  Currently the implementation mainly watches for PCollection variables defined
 
 Review comment:
   I think this information needs to happen first--what do you think about 
moving lines 68-80 to line 38? If so, you can disregard my comments above about 
"understands"
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312394)
Time Spent: 5.5h  (was: 5h 20m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312397=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312397
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 22:07
Start Date: 13/Sep/19 22:07
Worklog Time Spent: 10m 
  Work Description: rosetn commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324373045
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
 
 Review comment:
   "or the  \_\_main\_\_ "
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312397)
Time Spent: 5h 40m  (was: 5.5h)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312391=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312391
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 22:07
Start Date: 13/Sep/19 22:07
Worklog Time Spent: 10m 
  Work Description: rosetn commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324376318
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
+  __main__ module is always watched by default, you don't have to instruct
+  Interactive Beam. However, if your Beam pipeline is defined in some module
+  other than __main__, e.g., inside a class function or a unit test, you can
 
 Review comment:
   The object of the verb is a little confusing and some of the pronouns are 
ambiguous, maybe we can break it into two sentences? Check if my rewrite is 
what you actually meant:
   
   So replace:
   "However, if your Beam pipeline is defined in some module
 other than \_\_main\_\_, e.g., inside a class function or a unit test, you 
can
 watch() the scope to instruct the whereabouts of your pipeline definition 
so
 Interactive Beam could apply interactivity to your pipeline when running 
it."
   
   with
   
   "If your Beam pipeline is defined in some module
 other than \_\_main\_\_, such as inside a class function or a unit test, 
you can
 watch() the scope. This allows Interactive Beam to implicitly pass on the 
information about the location of your pipeline definition."
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312391)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312387=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312387
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 22:07
Start Date: 13/Sep/19 22:07
Worklog Time Spent: 10m 
  Work Description: rosetn commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324371613
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
 
 Review comment:
   Is "watchable" a common term? I see that you define it in the test file, but 
it doesn't make sense to me without context. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312387)
Time Spent: 5h 10m  (was: 5h)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312386=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312386
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 22:07
Start Date: 13/Sep/19 22:07
Worklog Time Spent: 10m 
  Work Description: rosetn commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324370787
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
 
 Review comment:
   If it means the same thing, you can replace "watches" with "monitors" to 
make the sentence clearer
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312386)
Time Spent: 5h  (was: 4h 50m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312385=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312385
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 22:07
Start Date: 13/Sep/19 22:07
Worklog Time Spent: 10m 
  Work Description: rosetn commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324369391
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
 
 Review comment:
   Is this note stating something different that the other interactive Beam 
modules? If yes, then et's use the same disclaimer for consistency:
   
   "This module is experimental. No backwards-compatibility guarantees."
   
   Else, if there's extra information to share, I think you can cut down on 
words by replacing the note with:
   
   "Note: If you want backward-compatibility, only invoke interfaces provided 
by this module in your notebook or application code."
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312385)
Time Spent: 5h  (was: 4h 50m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312389=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312389
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 22:07
Start Date: 13/Sep/19 22:07
Worklog Time Spent: 10m 
  Work Description: rosetn commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324374036
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
+  __main__ module is always watched by default, you don't have to instruct
+  Interactive Beam. However, if your Beam pipeline is defined in some module
+  other than __main__, e.g., inside a class function or a unit test, you can
+  watch() the scope to instruct the whereabouts of your pipeline definition so
+  Interactive Beam could apply interactivity to your pipeline when running it.
+
+For example:
+
+class Foo(object)
+  def build_pipeline(self):
+p = beam.Pipeline()
+init_pcoll = p |  'Init Create' >> beam.Create(range(10))
+watch(locals())
+return p
+Foo().build_pipeline().run()
+
+Interactive Beam will cache init_pcoll for the first run. You can use:
 
 Review comment:
   "will cache"->"caches"
   
   https://developers.google.com/style/tense
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312389)
Time Spent: 5h 10m  (was: 5h)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312388=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312388
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 22:07
Start Date: 13/Sep/19 22:07
Worklog Time Spent: 10m 
  Work Description: rosetn commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324383327
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
+  __main__ module is always watched by default, you don't have to instruct
+  Interactive Beam. However, if your Beam pipeline is defined in some module
+  other than __main__, e.g., inside a class function or a unit test, you can
+  watch() the scope to instruct the whereabouts of your pipeline definition so
+  Interactive Beam could apply interactivity to your pipeline when running it.
+
+For example:
+
+class Foo(object)
+  def build_pipeline(self):
+p = beam.Pipeline()
+init_pcoll = p |  'Init Create' >> beam.Create(range(10))
+watch(locals())
+return p
+Foo().build_pipeline().run()
+
+Interactive Beam will cache init_pcoll for the first run. You can use:
+
+visualize(init_pcoll)
 
 Review comment:
   It might be easier to see related ideas if you add some transitions:
   
   "Interactive Beam caches init_pcoll for the first run. Once the pipeline is 
executed, you can use visualize(init_pcoll) to visualize data from init_pcoll. 
   
   You can then make the following change to your original pipeline to add 
squares using the init_pcoll data:
squares = init_pcoll | 'Square' >> beam.Map(lambda x: x*x)
   "
   
   I'm not sure why adding squares will add an "interactive experience" though, 
could you elaborate a bit more?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312388)
Time Spent: 5h 10m  (was: 5h)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=312395=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-312395
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Sep/19 22:07
Start Date: 13/Sep/19 22:07
Worklog Time Spent: 10m 
  Work Description: rosetn commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r324374632
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
 
 Review comment:
   Do you only mean a Jupyter notebook?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 312395)
Time Spent: 5.5h  (was: 5h 20m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=310859=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310859
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 11/Sep/19 17:17
Start Date: 11/Sep/19 17:17
Worklog Time Spent: 10m 
  Work Description: aaltay commented on issue #9278: [BEAM-7760] Added 
Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#issuecomment-530477700
 
 
   Could you check the failing test please?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 310859)
Time Spent: 4h 50m  (was: 4h 40m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=310094=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310094
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 10/Sep/19 20:44
Start Date: 10/Sep/19 20:44
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r322953360
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
 
 Review comment:
   Thx for sharing the contact!
   Will wait for @rosetn 's feedback to improve the comments before merging. 
ETA by this Friday.
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 310094)
Time Spent: 4h 40m  (was: 4.5h)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=309212=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309212
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 09/Sep/19 20:40
Start Date: 09/Sep/19 20:40
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r322443957
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam_test.py
 ##
 @@ -0,0 +1,70 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Tests for apache_beam.runners.interactive.interactive_beam."""
+
+import importlib
+import unittest
+
+from apache_beam.runners.interactive import interactive_beam as ib
+from apache_beam.runners.interactive import interactive_environment as ie
+
+# The module name is also a variable in module.
+_module_name = 'apache_beam.runners.interactive.interactive_beam_test'
+
+
+class InteractiveBeamTest(unittest.TestCase):
+
+  def setUp(self):
+self._var_in_class_instance = 'a var in class instance, not directly used'
+ie.new_env()
+
+  def test_watch_main_by_default(self):
+test_env = ie.InteractiveEnvironment()
+# Current Interactive Beam env fetched and the test env are 2 instances.
+self.assertNotEqual(id(ie.current_env()), id(test_env))
+self.assertEqual(ie.current_env().watching(), test_env.watching())
+
+  def test_watch_a_module_by_name(self):
+test_env = ie.InteractiveEnvironment()
+ib.watch(_module_name)
+test_env.watch(_module_name)
+self.assertEqual(ie.current_env().watching(), test_env.watching())
 
 Review comment:
   Sounds good.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309212)
Time Spent: 4.5h  (was: 4h 20m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=309176=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309176
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 09/Sep/19 20:09
Start Date: 09/Sep/19 20:09
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r322430733
 
 

 ##
 File path: 
sdks/python/apache_beam/runners/interactive/interactive_environment.py
 ##
 @@ -0,0 +1,106 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of the current Interactive Beam environment.
+
+Provides interfaces to interact with existing Interactive Beam environment.
+Internally used by Interactive Beam. External Interactive Beam users please use
 
 Review comment:
   Applied the convention.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309176)
Time Spent: 4h 20m  (was: 4h 10m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=309171=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309171
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 09/Sep/19 20:00
Start Date: 09/Sep/19 20:00
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r322427468
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
 
 Review comment:
   Sure, I'll contact @rosetn directly to work on improve the documentation. 
Would you please share me their ldap or work email address so that I can reach 
out?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309171)
Time Spent: 4h 10m  (was: 4h)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=309146=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309146
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 09/Sep/19 18:39
Start Date: 09/Sep/19 18:39
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r322394566
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam_test.py
 ##
 @@ -0,0 +1,70 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Tests for apache_beam.runners.interactive.interactive_beam."""
+
+import importlib
+import unittest
+
+from apache_beam.runners.interactive import interactive_beam as ib
+from apache_beam.runners.interactive import interactive_environment as ie
+
+# The module name is also a variable in module.
+_module_name = 'apache_beam.runners.interactive.interactive_beam_test'
+
+
+class InteractiveBeamTest(unittest.TestCase):
+
+  def setUp(self):
+self._var_in_class_instance = 'a var in class instance, not directly used'
+ie.new_env()
+
+  def test_watch_main_by_default(self):
+test_env = ie.InteractiveEnvironment()
+# Current Interactive Beam env fetched and the test env are 2 instances.
+self.assertNotEqual(id(ie.current_env()), id(test_env))
+self.assertEqual(ie.current_env().watching(), test_env.watching())
+
+  def test_watch_a_module_by_name(self):
+test_env = ie.InteractiveEnvironment()
+ib.watch(_module_name)
+test_env.watch(_module_name)
+self.assertEqual(ie.current_env().watching(), test_env.watching())
 
 Review comment:
   These tests verifies that the static watch() function in interactive_beam 
module will eventually generate the same watching() items to the member watch() 
function in interactive_environment module InteractiveEnvironment class when 
the correct InteractiveEnvironment object is accessed.
   Whatever future changes happen to the wrapper in interactiv_beam module, it 
should keep this behavior.
   It currently basically tests if "ie.current_env().watch(watchable)" works as 
intended: create environment if absent, access current environment and invoke 
the watch() logic for that environment. 
   It doesn't verify if "watch(watchable)" generates expected "watching()" 
because that is not the logic of this module.
   
   The unit tests test the wrapped watch() logic in interactive_beam module 
where the InteractiveEnvironment objects may be created and accessed. The real 
watch() logic within an InteractiveEnvironment object from 
interactive_environment module is tested in its own unit test.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309146)
Time Spent: 4h  (was: 3h 50m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=309068=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309068
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 09/Sep/19 16:51
Start Date: 09/Sep/19 16:51
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r322344925
 
 

 ##
 File path: 
sdks/python/apache_beam/runners/interactive/interactive_environment.py
 ##
 @@ -0,0 +1,106 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of the current Interactive Beam environment.
+
+Provides interfaces to interact with existing Interactive Beam environment.
+Internally used by Interactive Beam. External Interactive Beam users please use
 
 Review comment:
   For marking internal use, we generally use "For internal use only; no 
backwards-compatibility guarantees." on its own line.
   
   See: 
https://github.com/apache/beam/blob/51a13f0a8e5a1088f8a96942b4787d7da25e54dc/sdks/python/apache_beam/utils/counters.py#L23
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309068)
Time Spent: 3h 50m  (was: 3h 40m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=309067=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309067
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 09/Sep/19 16:51
Start Date: 09/Sep/19 16:51
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r322344323
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam_test.py
 ##
 @@ -0,0 +1,70 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Tests for apache_beam.runners.interactive.interactive_beam."""
+
+import importlib
+import unittest
+
+from apache_beam.runners.interactive import interactive_beam as ib
+from apache_beam.runners.interactive import interactive_environment as ie
+
+# The module name is also a variable in module.
+_module_name = 'apache_beam.runners.interactive.interactive_beam_test'
+
+
+class InteractiveBeamTest(unittest.TestCase):
+
+  def setUp(self):
+self._var_in_class_instance = 'a var in class instance, not directly used'
+ie.new_env()
+
+  def test_watch_main_by_default(self):
+test_env = ie.InteractiveEnvironment()
+# Current Interactive Beam env fetched and the test env are 2 instances.
+self.assertNotEqual(id(ie.current_env()), id(test_env))
+self.assertEqual(ie.current_env().watching(), test_env.watching())
+
+  def test_watch_a_module_by_name(self):
+test_env = ie.InteractiveEnvironment()
+ib.watch(_module_name)
+test_env.watch(_module_name)
+self.assertEqual(ie.current_env().watching(), test_env.watching())
 
 Review comment:
   What is really tested here? We call watch on two environments and check that 
they have the same results. It is not checking whether the _module_name is in 
the watch set or not as the test name implies. (Same comment for other 
assertEqual's in this file.)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309067)
Time Spent: 3h 40m  (was: 3.5h)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=309066=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309066
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 09/Sep/19 16:51
Start Date: 09/Sep/19 16:51
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#discussion_r322342679
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of Interactive Beam features that can be used in notebook.
+
+The purpose of the module is to reduce the learning curve of Interactive Beam
+users, provide a single place for importing and add sugar syntax for all
+Interactive Beam components. It gives users capability to interact with 
existing
+environment/session/context for Interactive Beam and visualize PCollections as
+bounded dataset. In the meantime, it hides the interactivity implementation
+from users so that users can focus on developing Beam pipeline without worrying
+about how hidden states in the interactive session are managed.
+
+Note: Backward-compatibility of Interactive Beam is only guaranteed within this
+module. Please only invoke interfaces provided by this module in your notebook
+or application code if you want backward-compatibility.
+"""
+
+from apache_beam.runners.interactive import interactive_environment as ie
+
+
+def watch(watchable):
+  """Watches a watchable so that Interactive Beam understands your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
 
 Review comment:
   For a follow up PR, I will suggest working with @rosetn to improve this 
comment. This comment will serve as documentation, and it will help to get it 
reviewed by tech writers.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309066)
Time Spent: 3.5h  (was: 3h 20m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-08-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=304599=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-304599
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 30/Aug/19 21:10
Start Date: 30/Aug/19 21:10
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on issue #9278: [BEAM-7760] Added 
Interactive Beam module
URL: https://github.com/apache/beam/pull/9278#issuecomment-526751231
 
 
   Based on our offline discussion, we've decided not to make create_pipeline() 
and run_pipeline() API as part of interactive_beam module, but incorporating 
the features into existing beam module. I've updated the design doc: 
https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing
   
   I've also removed the related code and tests from this PR.
   
   In future beam changes, 
   1. the beam.Pipeline() constructor would by default use 
runner=InteractiveRunner() when 
`apache_beam.runners.interactive.interactive_beam` module has been imported.
   2. the pipeline.run() function would be able to take in runner=? and 
options=? to run existing pipeline with selected runner and options. 'runner' 
option would support both string name and runner object.
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 304599)
Time Spent: 3h 20m  (was: 3h 10m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-08-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=299769=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299769
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 22/Aug/19 21:15
Start Date: 22/Aug/19 21:15
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278
 
 
   **Please** add a meaningful description for your change here
   1. Added interactive_beam module that will serve sugar syntax and
   shorthand functions to apply interactivity, create iBeam pipeline,
   visualize PCollection data and execute iBeam pipeline as normal pipeline
   with selected Beam runners without interactivity.
   2. This commit implemented the implicitly managed Interactive Beam
   environment to track definition of user pipelines. It exposed a watch()
   interface for users to explicitly instruct Interactive Beam the
   whereabout of their pipeline definition when it's not in __main__.
   3. This commit implemented a shorthand function create_pipeline()  to
   create a pipeline that is backed by direct runner with interactivity
   when running.
   4. This commit also implemented a shorthand function run_pipeline() to
   run a pipeline created with interactivity on a different runner and
   pipeline options without interactivity. It's useful when interactivity
   is not needed and a one-shot in production-like environment is desired.
   5. This commit exposed a PCollection data exploration interface
   visualize(). Implementation is yet to be added.
   6. Added interactive_environment module for internal usage without
   backward-compatibility. It holds the cache manager and watchable
   metadata for current interactive environment/session/context. Interfaces
   are provided to interact with the environment and its components.
   7. Unit tests included.
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [x] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [x] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-08-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=299749=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299749
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 22/Aug/19 20:46
Start Date: 22/Aug/19 20:46
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added Interactive Beam module
URL: https://github.com/apache/beam/pull/9278
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 299749)
Time Spent: 3h  (was: 2h 50m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-08-13 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=294288=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294288
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 13/Aug/19 23:59
Start Date: 13/Aug/19 23:59
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on issue #9278: [BEAM-7760] Added 
iBeam module
URL: https://github.com/apache/beam/pull/9278#issuecomment-521052414
 
 
   Hi all, I'll leave another 3 days for 
[design](https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing)
 review. Then we can have a vote session if there is no objections.
   
   Thanks!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294288)
Time Spent: 2h 50m  (was: 2h 40m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-08-09 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=292216=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-292216
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 09/Aug/19 18:41
Start Date: 09/Aug/19 18:41
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on issue #9278: [BEAM-7760] Added 
iBeam module
URL: https://github.com/apache/beam/pull/9278#issuecomment-520023376
 
 
   Cool! I'll not touch README yet since I'm constructing the building blocks 
of a new iBeam without integrating (thus changing) behaviors of existing iBeam. 
But once I make those integration and change, I'll update the README as changes 
go.
   
   For a broader design document, I composed a globally visible 
[design](https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing)
 overview describing changes we are making to components around interactive 
runner. I'll share the document in our email thread too.
   
   The truth is since interactive runner is not yet a recognized runner as part 
of the Beam SDK (and it's fundamentally a wrapper around direct runner), we are 
not touching any Beam SDK components. We'll not change any behavior of existing 
Beam SDK and we'll try our best to keep it that way in the future.
   
   In the mean time, I'll work on other components orthogonal to Beam such as 
Pipeline Display and Data Visualization I mentioned in the design overview.
   
   Thanks!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 292216)
Time Spent: 2h 40m  (was: 2.5h)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


  1   2   >