[jira] [Created] (BEAM-8520) Expansion of _CombinePerKeyWithHotKeyFanout has no typehints

2019-10-30 Thread Yifan Mai (Jira)
Yifan Mai created BEAM-8520:
---

 Summary: Expansion of _CombinePerKeyWithHotKeyFanout has no 
typehints
 Key: BEAM-8520
 URL: https://issues.apache.org/jira/browse/BEAM-8520
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Reporter: Yifan Mai


The expansion of 
[_CombinePerKeyWithHotKeyFanout|https://github.com/apache/beam/blob/bd09a038f4df896ef6d903a4944083fcd2e8f727/sdks/python/apache_beam/transforms/core.py#L1913],
 which is used by 
[CombinePerKey.with_hot_key_fanout()|https://github.com/apache/beam/blob/bd09a038f4df896ef6d903a4944083fcd2e8f727/sdks/python/apache_beam/transforms/core.py#L1734],
 uses DoFns (SplitHotCold) and CombineFns (PreCombineFn, PostCombineFn) and a 
lambda (SplitNonce) that have no typehints. If the user specified a typehints 
on the original CombineFn, the typehints are lost. This means that the fallback 
coder rather than the user specified coder.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8381) Typehints on DoFn subclass are incorrectly set on superclass

2019-10-30 Thread Yifan Mai (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifan Mai updated BEAM-8381:

Summary: Typehints on DoFn subclass are incorrectly set on superclass  
(was: Typehints on DoFn subclass are incorrect set on superclass)

> Typehints on DoFn subclass are incorrectly set on superclass
> 
>
> Key: BEAM-8381
> URL: https://issues.apache.org/jira/browse/BEAM-8381
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Yifan Mai
>Priority: Minor
>
> Suppose a parent DoFn with typehints is subclassed into a child DoFn with 
> different typehints. The typehints will be incorrectly set on the parent DoFn 
> class instead of the child DoFn class, causing type checking errors. This 
> only happens if the parent already has typehints; if the parent has no 
> typehints, then the typehints are correctly set on the child.
> Here's a example test case. I would expect this example to run successfully, 
> but instead it fails with {{apache_beam.typehints.decorators.TypeCheckError: 
> Type hint violation for 'AddOneAndStringify': requires  but got 
>  for element}}.
> {code}
> def test_do_fn_pipeline_pipeline_type_check_satisfied_with_subclassing(self):
>   @with_input_types(int)
>   @with_output_types(int)
>   class AddOne(beam.DoFn):
> def add_one(elements):
>   return element + 1
> def process(self, element):
>   return [add_one(element)]
>   @with_input_types(int)
>   @with_output_types(str)
>   class AddOneAndStringify(AddOne):
> def process(self, element):
>   return [str(add_one(element))]
>   d = (self.p
>| 'T' >> beam.Create([1, 2, 3]).with_output_types(int)
>| 'AddOne' >> beam.ParDo(AddOne())
>| 'AddOneAndStringify' >> beam.ParDo(AddOneAndStringify()))
>   assert_that(d, equal_to(['3', '4', '5']))
>   self.p.run()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8519) Typehint for Python CombineFn Accumulator

2019-10-30 Thread Yifan Mai (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifan Mai updated BEAM-8519:

Description: 
Feature request:
Allow users to specify a typehint for the accumulator type of CombineFn, either 
through a typehints decorator, or by using Python type annotations on the 
CombineFn methods, or through inheriting a CombineFn [generic 
type|https://docs.python.org/3/library/typing.html#user-defined-generic-types].

Benefits:
 * Allow the user to [specify a more efficient 
coder|https://beam.apache.org/documentation/programming-guide/#data-encoding-and-type-safety]
 for accumulators if one is available. Currently users are able to do this for 
the input and output elements, but not the accumulators, which [always uses the 
fallback 
coder|https://github.com/apache/beam/blob/d664592ee761d45b0273edb027a18c6ed9340349/sdks/python/apache_beam/transforms/core.py#L859-L860].
 * Allows typechecking of methods for CombineFn e.g. check that 
create_accumulator(), add_input() and merge_accumulators() returns 
accumulators. This provides better developer ergonomics as the CombineFnAPI is 
non-trivial.

  was:
Feature request:
Allow users to specify a typehint for the accumulator type of CombineFn, either 
through a typehints decorator, or by using Python type annotations on the 
CombineFn methods, or through inheriting a CombineFn [generic 
type|https://docs.python.org/3/library/typing.html#user-defined-generic-types].

Benefits:
 * Allow the user to [specify a more efficient 
coder|https://beam.apache.org/documentation/programming-guide/#data-encoding-and-type-safety]
 for accumulators if one is available. Currently users are able to do this for 
the input and output elements, but not the accumulators, which [always uses the 
fallback 
coder|https://github.com/apache/beam/blob/d664592ee761d45b0273edb027a18c6ed9340349/sdks/python/apache_beam/transforms/core.py#L859-L860].
 * Allows typechecking of methods for CombineFn e.g. check that 
create_accumulator(), add_input() and merge_accumulators() returns 
accumulators. This provides better developer ergonomics as the CombineFnAPI is 
non-trivial


> Typehint for Python CombineFn Accumulator
> -
>
> Key: BEAM-8519
> URL: https://issues.apache.org/jira/browse/BEAM-8519
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Yifan Mai
>Priority: Minor
>
> Feature request:
> Allow users to specify a typehint for the accumulator type of CombineFn, 
> either through a typehints decorator, or by using Python type annotations on 
> the CombineFn methods, or through inheriting a CombineFn [generic 
> type|https://docs.python.org/3/library/typing.html#user-defined-generic-types].
> Benefits:
>  * Allow the user to [specify a more efficient 
> coder|https://beam.apache.org/documentation/programming-guide/#data-encoding-and-type-safety]
>  for accumulators if one is available. Currently users are able to do this 
> for the input and output elements, but not the accumulators, which [always 
> uses the fallback 
> coder|https://github.com/apache/beam/blob/d664592ee761d45b0273edb027a18c6ed9340349/sdks/python/apache_beam/transforms/core.py#L859-L860].
>  * Allows typechecking of methods for CombineFn e.g. check that 
> create_accumulator(), add_input() and merge_accumulators() returns 
> accumulators. This provides better developer ergonomics as the CombineFnAPI 
> is non-trivial.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-8519) Typehint for Python CombineFn Accumulator

2019-10-30 Thread Yifan Mai (Jira)
Yifan Mai created BEAM-8519:
---

 Summary: Typehint for Python CombineFn Accumulator
 Key: BEAM-8519
 URL: https://issues.apache.org/jira/browse/BEAM-8519
 Project: Beam
  Issue Type: New Feature
  Components: sdk-py-core
Reporter: Yifan Mai


Feature request:
Allow users to specify a typehint for the accumulator type of CombineFn, either 
through a typehints decorator, or by using Python type annotations on the 
CombineFn methods, or through inheriting a CombineFn [generic 
type|https://docs.python.org/3/library/typing.html#user-defined-generic-types].

Benefits:
 * Allow the user to [specify a more efficient 
coder|https://beam.apache.org/documentation/programming-guide/#data-encoding-and-type-safety]
 for accumulators if one is available. Currently users are able to do this for 
the input and output elements, but not the accumulators, which [always uses the 
fallback 
coder|https://github.com/apache/beam/blob/d664592ee761d45b0273edb027a18c6ed9340349/sdks/python/apache_beam/transforms/core.py#L859-L860].
 * Allows typechecking of methods for CombineFn e.g. check that 
create_accumulator(), add_input() and merge_accumulators() returns 
accumulators. This provides better developer ergonomics as the CombineFnAPI is 
non-trivial



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-8381) Typehints on DoFn subclass are incorrect set on superclass

2019-10-10 Thread Yifan Mai (Jira)
Yifan Mai created BEAM-8381:
---

 Summary: Typehints on DoFn subclass are incorrect set on superclass
 Key: BEAM-8381
 URL: https://issues.apache.org/jira/browse/BEAM-8381
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Reporter: Yifan Mai


Suppose a parent DoFn with typehints is subclassed into a child DoFn with 
different typehints. The typehints will be incorrectly set on the parent DoFn 
class instead of the child DoFn class, causing type checking errors. This only 
happens if the parent already has typehints; if the parent has no typehints, 
then the typehints are correctly set on the child.

Here's a example test case. I would expect this example to run successfully, 
but instead it fails with {{apache_beam.typehints.decorators.TypeCheckError: 
Type hint violation for 'AddOneAndStringify': requires  but got 
 for element}}.

{code}
def test_do_fn_pipeline_pipeline_type_check_satisfied_with_subclassing(self):
  @with_input_types(int)
  @with_output_types(int)
  class AddOne(beam.DoFn):
def add_one(elements):
  return element + 1

def process(self, element):
  return [add_one(element)]

  @with_input_types(int)
  @with_output_types(str)
  class AddOneAndStringify(AddOne):
def process(self, element):
  return [str(add_one(element))]

  d = (self.p
   | 'T' >> beam.Create([1, 2, 3]).with_output_types(int)
   | 'AddOne' >> beam.ParDo(AddOne())
   | 'AddOneAndStringify' >> beam.ParDo(AddOneAndStringify()))

  assert_that(d, equal_to(['3', '4', '5']))
  self.p.run()
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (BEAM-3645) Support multi-process execution on the FnApiRunner

2019-09-06 Thread Yifan Mai (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924585#comment-16924585
 ] 

Yifan Mai edited comment on BEAM-3645 at 9/6/19 8:54 PM:
-

While testing this I noticed that the multi-process runner does not handle 
SIGINT gracefully. I added the repro steps to BEAM-8149.


was (Author: myffi...@gmail.com):
While testing this I noticed that the multi-process runner does not handle 
SIGINT gracefully. To reproduce, run wordcount.py using the "Run with 
multiprocessing mode" instructions from the comment above (in Python 3).

Expected: wordcount terminates gracefully when Ctrl-C is pressed during 
pipeline execution (similarly to default direct runner)
Actual: wordcount hangs forever after printing the following once per worker:

{code}
Exception in thread run_worker:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
  File 
"/usr/local/google/home/yifanmai/venv/wordcount/lib/python3.6/site-packages/apache_beam/runners/portability/local_job_service.py",
 line 216, in run
'Worker subprocess exited with return code %s' % p.returncode)
RuntimeError: Worker subprocess exited with return code 1
{code}

> Support multi-process execution on the FnApiRunner
> --
>
> Key: BEAM-3645
> URL: https://issues.apache.org/jira/browse/BEAM-3645
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Charles Chen
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.15.0
>
>  Time Spent: 35h 20m
>  Remaining Estimate: 0h
>
> https://issues.apache.org/jira/browse/BEAM-3644 gave us a 15x performance 
> gain over the previous DirectRunner.  We can do even better in multi-core 
> environments by supporting multi-process execution in the FnApiRunner, to 
> scale past Python GIL limitations.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (BEAM-8149) [FnApiRunner]multi-process runner does not terminate cleanly upon receiving SIGINT

2019-09-06 Thread Yifan Mai (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifan Mai updated BEAM-8149:

Description: 
The multi-process runner does not handle SIGINT gracefully. To reproduce, run 
wordcount.py using the "Run with multiprocessing mode" instructions from the 
first comment in BEAM-3645 (in Python 3).

Expected: wordcount terminates gracefully when Ctrl-C is pressed during 
pipeline execution (similarly to default direct runner)
Actual: wordcount hangs forever after printing the following once per worker:

{code}
Exception in thread run_worker:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
  File 
"/usr/local/google/home/yifanmai/venv/wordcount/lib/python3.6/site-packages/apache_beam/runners/portability/local_job_service.py",
 line 216, in run
'Worker subprocess exited with return code %s' % p.returncode)
RuntimeError: Worker subprocess exited with return code 1
{code}

  was:
The multi-process runner does not handle SIGINT gracefully. To reproduce, run 
wordcount.py using the "Run with multiprocessing mode" instructions from the 
comment above (in Python 3).

Expected: wordcount terminates gracefully when Ctrl-C is pressed during 
pipeline execution (similarly to default direct runner)
Actual: wordcount hangs forever after printing the following once per worker:

{code}
Exception in thread run_worker:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
  File 
"/usr/local/google/home/yifanmai/venv/wordcount/lib/python3.6/site-packages/apache_beam/runners/portability/local_job_service.py",
 line 216, in run
'Worker subprocess exited with return code %s' % p.returncode)
RuntimeError: Worker subprocess exited with return code 1
{code}


> [FnApiRunner]multi-process runner does not terminate cleanly upon receiving 
> SIGINT
> --
>
> Key: BEAM-8149
> URL: https://issues.apache.org/jira/browse/BEAM-8149
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.15.0
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
>
> The multi-process runner does not handle SIGINT gracefully. To reproduce, run 
> wordcount.py using the "Run with multiprocessing mode" instructions from the 
> first comment in BEAM-3645 (in Python 3).
> Expected: wordcount terminates gracefully when Ctrl-C is pressed during 
> pipeline execution (similarly to default direct runner)
> Actual: wordcount hangs forever after printing the following once per worker:
> {code}
> Exception in thread run_worker:
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
> self.run()
>   File "/usr/lib/python3.6/threading.py", line 864, in run
> self._target(*self._args, **self._kwargs)
>   File 
> "/usr/local/google/home/yifanmai/venv/wordcount/lib/python3.6/site-packages/apache_beam/runners/portability/local_job_service.py",
>  line 216, in run
> 'Worker subprocess exited with return code %s' % p.returncode)
> RuntimeError: Worker subprocess exited with return code 1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (BEAM-8149) [FnApiRunner]multi-process runner does not terminate cleanly upon receiving SIGINT

2019-09-06 Thread Yifan Mai (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifan Mai updated BEAM-8149:

Description: 
The multi-process runner does not handle SIGINT gracefully. To reproduce, run 
wordcount.py using the "Run with multiprocessing mode" instructions from the 
comment above (in Python 3).

Expected: wordcount terminates gracefully when Ctrl-C is pressed during 
pipeline execution (similarly to default direct runner)
Actual: wordcount hangs forever after printing the following once per worker:

{code}
Exception in thread run_worker:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
  File 
"/usr/local/google/home/yifanmai/venv/wordcount/lib/python3.6/site-packages/apache_beam/runners/portability/local_job_service.py",
 line 216, in run
'Worker subprocess exited with return code %s' % p.returncode)
RuntimeError: Worker subprocess exited with return code 1
{code}

> [FnApiRunner]multi-process runner does not terminate cleanly upon receiving 
> SIGINT
> --
>
> Key: BEAM-8149
> URL: https://issues.apache.org/jira/browse/BEAM-8149
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.15.0
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
>
> The multi-process runner does not handle SIGINT gracefully. To reproduce, run 
> wordcount.py using the "Run with multiprocessing mode" instructions from the 
> comment above (in Python 3).
> Expected: wordcount terminates gracefully when Ctrl-C is pressed during 
> pipeline execution (similarly to default direct runner)
> Actual: wordcount hangs forever after printing the following once per worker:
> {code}
> Exception in thread run_worker:
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
> self.run()
>   File "/usr/lib/python3.6/threading.py", line 864, in run
> self._target(*self._args, **self._kwargs)
>   File 
> "/usr/local/google/home/yifanmai/venv/wordcount/lib/python3.6/site-packages/apache_beam/runners/portability/local_job_service.py",
>  line 216, in run
> 'Worker subprocess exited with return code %s' % p.returncode)
> RuntimeError: Worker subprocess exited with return code 1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (BEAM-3645) Support multi-process execution on the FnApiRunner

2019-09-06 Thread Yifan Mai (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924585#comment-16924585
 ] 

Yifan Mai commented on BEAM-3645:
-

While testing this I noticed that the multi-process runner does not handle 
SIGINT gracefully. To reproduce, run wordcount.py using the "Run with 
multiprocessing mode" instructions from the comment above (in Python 3).

Expected: wordcount terminates gracefully when Ctrl-C is pressed during 
pipeline execution (similarly to default direct runner)
Actual: wordcount hangs forever after printing the following once per worker:

{code}
Exception in thread run_worker:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
  File 
"/usr/local/google/home/yifanmai/venv/wordcount/lib/python3.6/site-packages/apache_beam/runners/portability/local_job_service.py",
 line 216, in run
'Worker subprocess exited with return code %s' % p.returncode)
RuntimeError: Worker subprocess exited with return code 1
{code}

> Support multi-process execution on the FnApiRunner
> --
>
> Key: BEAM-3645
> URL: https://issues.apache.org/jira/browse/BEAM-3645
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Charles Chen
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.15.0
>
>  Time Spent: 35h 20m
>  Remaining Estimate: 0h
>
> https://issues.apache.org/jira/browse/BEAM-3644 gave us a 15x performance 
> gain over the previous DirectRunner.  We can do even better in multi-core 
> environments by supporting multi-process execution in the FnApiRunner, to 
> scale past Python GIL limitations.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (BEAM-529) Check immutability violations in DirectPipelineRunner

2019-06-05 Thread Yifan Mai (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856777#comment-16856777
 ] 

Yifan Mai edited comment on BEAM-529 at 6/5/19 3:06 PM:


Sorry, I haven't captured the proposal on JIRA yet.

The general idea to have DoFnRunner hash each input element (or some sample of 
input elements) before and after the DoFn is run. If the hashes differ, then 
the input element was mutated and the pipeline should return an error.

The problem is that does not actually have the semantics we want. See 
https://docs.python.org/3/reference/datamodel.html#object.__hash__

# Not all objects are hashable. For instance mutable containers like lists are 
unhashable.
# User defined classes are hashable by default, but the default hash is simply 
the id of the object, rather than its contents.

I've tried some workarounds such as:

# Convert unhashable containers to immutable hashable containers before hashing 
them
# Directly read  the {{__attr__}} of user defined classes and hash the elements

Even so, there are user defined classes that still break under this scheme. For 
instance, pandas DataFrame has properties that, when read, modifies a cache 
that is stored as a parameter. This scheme will treat the cache modification as 
a mutation and incorrectly raise a false positive.

As such, I haven't come up with a way to do this in a way that is robust enough 
to cover all conceivable user code.


was (Author: myffi...@gmail.com):
Sorry, I haven't captured the proposal on JIRA yet.

The general idea to have DoFnRunner hash each input element (or some sample of 
input elements) before and after the DoFn is run. If the hashes differ, then 
the input element was mutated and the pipeline should return an error.

The problem is that does not actually have the semantics we want. See 
https://docs.python.org/3/reference/datamodel.html#object.__hash__

# Not all objects are hashable. For instance mutable containers like lists are 
unhashable.
# User defined classes are hashable by default, but the default hash is simply 
the id of the object, rather than its contents.

I've tried some workarounds such as:

# Convert unhashable containers to immutable hashable containers before hashing 
them
# Traverse into the __attr__ of user defined classes and hash the elements

Even so, there are user defined classes that still break under this scheme. For 
instance, pandas DataFrame has properties that, when read, modifies a cache 
that is stored as a parameter. This scheme will treat the cache modification as 
a mutation and incorrectly raise a false positive.

As such, I haven't come up with a way to do this in a way that is robust enough 
to cover all conceivable user code.

> Check immutability violations in DirectPipelineRunner
> -
>
> Key: BEAM-529
> URL: https://issues.apache.org/jira/browse/BEAM-529
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Priority: Minor
>  Labels: newbie, starter
>
> Users are going to mutate inputs and outputs of DoFn inappropriately. We 
> should help their tests fail to catch such mistakes. (Similar to the 
> DirectPipelineRunner in Java SDK)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-529) Check immutability violations in DirectPipelineRunner

2019-06-05 Thread Yifan Mai (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856777#comment-16856777
 ] 

Yifan Mai commented on BEAM-529:


Sorry, I haven't captured the proposal on JIRA yet.

The general idea to have DoFnRunner hash each input element (or some sample of 
input elements) before and after the DoFn is run. If the hashes differ, then 
the input element was mutated and the pipeline should return an error.

The problem is that does not actually have the semantics we want. See 
https://docs.python.org/3/reference/datamodel.html#object.__hash__

# Not all objects are hashable. For instance mutable containers like lists are 
unhashable.
# User defined classes are hashable by default, but the default hash is simply 
the id of the object, rather than its contents.

I've tried some workarounds such as:

# Convert unhashable containers to immutable hashable containers before hashing 
them
# Traverse into the __attr__ of user defined classes and hash the elements

Even so, there are user defined classes that still break under this scheme. For 
instance, pandas DataFrame has properties that, when read, modifies a cache 
that is stored as a parameter. This scheme will treat the cache modification as 
a mutation and incorrectly raise a false positive.

As such, I haven't come up with a way to do this in a way that is robust enough 
to cover all conceivable user code.

> Check immutability violations in DirectPipelineRunner
> -
>
> Key: BEAM-529
> URL: https://issues.apache.org/jira/browse/BEAM-529
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Priority: Minor
>  Labels: newbie, starter
>
> Users are going to mutate inputs and outputs of DoFn inappropriately. We 
> should help their tests fail to catch such mistakes. (Similar to the 
> DirectPipelineRunner in Java SDK)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-529) Check immutability violations in DirectPipelineRunner

2019-06-04 Thread Yifan Mai (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856227#comment-16856227
 ] 

Yifan Mai commented on BEAM-529:


No, not planning to work on this in the near future. As discussed elsewhere, I 
am not confident that my proposal will actually work.

> Check immutability violations in DirectPipelineRunner
> -
>
> Key: BEAM-529
> URL: https://issues.apache.org/jira/browse/BEAM-529
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Priority: Minor
>  Labels: newbie, starter
>
> Users are going to mutate inputs and outputs of DoFn inappropriately. We 
> should help their tests fail to catch such mistakes. (Similar to the 
> DirectPipelineRunner in Java SDK)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-7340) DoFn.teardown metrics are lost in Python SDK

2019-05-16 Thread Yifan Mai (JIRA)
Yifan Mai created BEAM-7340:
---

 Summary: DoFn.teardown metrics are lost in Python SDK
 Key: BEAM-7340
 URL: https://issues.apache.org/jira/browse/BEAM-7340
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-harness
Reporter: Yifan Mai


If user code in DoFn.shutdown updates custom user metrics, those updates will 
not get registered e.g. counter increments are not registered.

Context: In 
[FnApiRunner.run_stages|https://github.com/apache/beam/blob/4629e82512ef1606c78cf28a2d66402c3533e23f/sdks/python/apache_beam/runners/portability/fn_api_runner.py#L342-L364],
 DoFn.teardown is called in worker_handler_manager.close_all, but this is 
called outside of the FnApiRunner.run_stage calls, so no metrics / monitoring 
info is retrieved there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (BEAM-563) DoFn Reuse: Update DirectRunner

2019-05-16 Thread Yifan Mai (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifan Mai closed BEAM-563.
--
   Resolution: Done
Fix Version/s: Not applicable

This was also done in https://github.com/apache/beam/pull/7994

> DoFn Reuse: Update DirectRunner
> ---
>
> Key: BEAM-563
> URL: https://issues.apache.org/jira/browse/BEAM-563
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Priority: Major
> Fix For: Not applicable
>
>
> https://issues.apache.org/jira/browse/BEAM-562 will add setup and teardown 
> methods to DoFns. Update DirectRunner to add support for these new methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-562) DoFn Reuse: Add new methods to DoFn

2019-05-16 Thread Yifan Mai (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifan Mai resolved BEAM-562.

   Resolution: Done
Fix Version/s: Not applicable

> DoFn Reuse: Add new methods to DoFn
> ---
>
> Key: BEAM-562
> URL: https://issues.apache.org/jira/browse/BEAM-562
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Yifan Mai
>Priority: Major
>  Labels: sdk-consistency
> Fix For: Not applicable
>
>  Time Spent: 10h 40m
>  Remaining Estimate: 0h
>
> Java SDK added setup and teardown methods to the DoFns. This makes DoFns 
> reusable and provide performance improvements. Python SDK should add support 
> for these new DoFn methods:
> Proposal doc: 
> https://docs.google.com/document/d/1LLQqggSePURt3XavKBGV7SZJYQ4NW8yCu63lBchzMRk/edit?ts=5771458f#



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (BEAM-7121) Provide deterministic version of Python's ProtoCoder

2019-05-16 Thread Yifan Mai (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifan Mai closed BEAM-7121.
---
   Resolution: Done
Fix Version/s: Not applicable

> Provide deterministic version of Python's ProtoCoder
> 
>
> Key: BEAM-7121
> URL: https://issues.apache.org/jira/browse/BEAM-7121
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-core
>Reporter: Yifan Mai
>Priority: Minor
> Fix For: Not applicable
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Passing deterministic=true to proto's 
> [SerializeToString|https://github.com/protocolbuffers/protobuf/blob/60b66a119d17f0a2a595a231bea87cd4f4cf2689/python/google/protobuf/message.py#L189-L204]
>  will result in deterministic encoding of maps in protos. This can be used to 
> provide a deterministic version of ProtoCoder.
> This would allow protos to be used as a key for grouping by key.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-7121) Provide deterministic version of Python's ProtoCoder

2019-04-22 Thread Yifan Mai (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifan Mai updated BEAM-7121:

Description: 
Passing deterministic=true to proto's 
[SerializeToString|https://github.com/protocolbuffers/protobuf/blob/60b66a119d17f0a2a595a231bea87cd4f4cf2689/python/google/protobuf/message.py#L189-L204]
 will result in deterministic encoding of maps in protos. This can be used to 
provide a deterministic version of ProtoCoder.

This would allow protos to be used as a key for grouping by key.

  was:Passing deterministic=true to proto's 
[SerializeToString|https://github.com/protocolbuffers/protobuf/blob/60b66a119d17f0a2a595a231bea87cd4f4cf2689/python/google/protobuf/message.py#L189-L204]
 will result in deterministic encoding of maps in protos. This can be used to 
provide a deterministic version of ProtoCoder.


> Provide deterministic version of Python's ProtoCoder
> 
>
> Key: BEAM-7121
> URL: https://issues.apache.org/jira/browse/BEAM-7121
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-core
>Reporter: Yifan Mai
>Priority: Minor
>
> Passing deterministic=true to proto's 
> [SerializeToString|https://github.com/protocolbuffers/protobuf/blob/60b66a119d17f0a2a595a231bea87cd4f4cf2689/python/google/protobuf/message.py#L189-L204]
>  will result in deterministic encoding of maps in protos. This can be used to 
> provide a deterministic version of ProtoCoder.
> This would allow protos to be used as a key for grouping by key.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-7121) Provide deterministic version of Python's ProtoCoder

2019-04-19 Thread Yifan Mai (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifan Mai updated BEAM-7121:

Description: Passing deterministic=true to proto's 
[SerializeToString|https://github.com/protocolbuffers/protobuf/blob/60b66a119d17f0a2a595a231bea87cd4f4cf2689/python/google/protobuf/message.py#L189-L204]
 will result in deterministic encoding of maps in protos. This can be used to 
provide a deterministic version of ProtoCoder.  (was: Passing 
deterministic=true to proto's 
[SerializeToString|[https://github.com/protocolbuffers/protobuf/blob/60b66a119d17f0a2a595a231bea87cd4f4cf2689/python/google/protobuf/message.py#L189-L204]]
 will result in deterministic encoding of maps in protos. This can be used to 
provide a deterministic version of ProtoCoder.)

> Provide deterministic version of Python's ProtoCoder
> 
>
> Key: BEAM-7121
> URL: https://issues.apache.org/jira/browse/BEAM-7121
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-core
>Reporter: Yifan Mai
>Priority: Minor
>
> Passing deterministic=true to proto's 
> [SerializeToString|https://github.com/protocolbuffers/protobuf/blob/60b66a119d17f0a2a595a231bea87cd4f4cf2689/python/google/protobuf/message.py#L189-L204]
>  will result in deterministic encoding of maps in protos. This can be used to 
> provide a deterministic version of ProtoCoder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-7121) Provide deterministic version of Python's ProtoCoder

2019-04-19 Thread Yifan Mai (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifan Mai updated BEAM-7121:

Description: Passing deterministic=true to proto's 
[SerializeToString|[https://github.com/protocolbuffers/protobuf/blob/60b66a119d17f0a2a595a231bea87cd4f4cf2689/python/google/protobuf/message.py#L189-L204]]
 will result in deterministic encoding of maps in protos. This can be used to 
provide a deterministic version of ProtoCoder.  (was: [Passing 
deterministic=true to proto's 
SerializeToString|[https://github.com/protocolbuffers/protobuf/blob/60b66a119d17f0a2a595a231bea87cd4f4cf2689/python/google/protobuf/message.py#L189-L204]]
 will result in deterministic encoding of maps in protos. This can be used to 
provide a deterministic version of ProtoCoder.)

> Provide deterministic version of Python's ProtoCoder
> 
>
> Key: BEAM-7121
> URL: https://issues.apache.org/jira/browse/BEAM-7121
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-core
>Reporter: Yifan Mai
>Priority: Minor
>
> Passing deterministic=true to proto's 
> [SerializeToString|[https://github.com/protocolbuffers/protobuf/blob/60b66a119d17f0a2a595a231bea87cd4f4cf2689/python/google/protobuf/message.py#L189-L204]]
>  will result in deterministic encoding of maps in protos. This can be used to 
> provide a deterministic version of ProtoCoder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-7121) Provide deterministic version of Python's ProtoCoder

2019-04-19 Thread Yifan Mai (JIRA)
Yifan Mai created BEAM-7121:
---

 Summary: Provide deterministic version of Python's ProtoCoder
 Key: BEAM-7121
 URL: https://issues.apache.org/jira/browse/BEAM-7121
 Project: Beam
  Issue Type: Improvement
  Components: runner-core
Reporter: Yifan Mai


[Passing deterministic=true to proto's 
SerializeToString|[https://github.com/protocolbuffers/protobuf/blob/60b66a119d17f0a2a595a231bea87cd4f4cf2689/python/google/protobuf/message.py#L189-L204]]
 will result in deterministic encoding of maps in protos. This can be used to 
provide a deterministic version of ProtoCoder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (BEAM-6746) Support DoFn.setup and DoFn.teardown in Python

2019-03-04 Thread Yifan Mai (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifan Mai closed BEAM-6746.
---
   Resolution: Duplicate
Fix Version/s: Not applicable

Looks like this was already filed as BEAM-562

> Support DoFn.setup and DoFn.teardown in Python
> --
>
> Key: BEAM-6746
> URL: https://issues.apache.org/jira/browse/BEAM-6746
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model, sdk-py-core
>Reporter: Yifan Mai
>Priority: Major
> Fix For: Not applicable
>
>
> DoFn.setup and DoFn.teardown is currently supported in Java but not Python. 
> These would be useful for performing expensive per-thread initialization.
> While lazy initialization is a usable workaround, first class support for 
> setup and teardown would encourage consistent conventions and make the API 
> more uniform with the Java version.
> This is related to BEAM-3736 (same issue, but for CombineFn).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-6746) Support DoFn.setup and DoFn.teardown in Python

2019-02-26 Thread Yifan Mai (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifan Mai updated BEAM-6746:

Issue Type: Improvement  (was: Bug)

> Support DoFn.setup and DoFn.teardown in Python
> --
>
> Key: BEAM-6746
> URL: https://issues.apache.org/jira/browse/BEAM-6746
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model, sdk-py-core
>Reporter: Yifan Mai
>Priority: Major
>
> DoFn.setup and DoFn.teardown is currently supported in Java but not Python. 
> These would be useful for performing expensive per-thread initialization.
> While lazy initialization is a usable workaround, first class support for 
> setup and teardown would encourage consistent conventions and make the API 
> more uniform with the Java version.
> This is related to BEAM-3736 (same issue, but for CombineFn).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-6746) Support DoFn.setup and DoFn.teardown in Python

2019-02-25 Thread Yifan Mai (JIRA)
Yifan Mai created BEAM-6746:
---

 Summary: Support DoFn.setup and DoFn.teardown in Python
 Key: BEAM-6746
 URL: https://issues.apache.org/jira/browse/BEAM-6746
 Project: Beam
  Issue Type: Bug
  Components: beam-model, sdk-py-core
Reporter: Yifan Mai


DoFn.setup and DoFn.teardown is currently supported in Java but not Python. 
These would be useful for performing expensive per-thread initialization.

While lazy initialization is a usable workaround, first class support for setup 
and teardown would encourage consistent conventions and make the API more 
uniform with the Java version.

This is related to BEAM-3736 (same issue, but for CombineFn).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)