[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288365&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288365 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 03/Aug/19 02:21 Start Date: 03/Aug/19 02:21 Worklog Time Spent: 10m Work Description: udim commented on issue #9239: [BEAM-7060] Fix :sdks:python:apache_beam:testing:load_tests:run breakage URL: https://github.com/apache/beam/pull/9239#issuecomment-517886877 Run Python Load Tests Smoke This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288365) Time Spent: 9h (was: 8h 50m) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 9h > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288366&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288366 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 03/Aug/19 02:21 Start Date: 03/Aug/19 02:21 Worklog Time Spent: 10m Work Description: udim commented on issue #9239: [BEAM-7060] Fix :sdks:python:apache_beam:testing:load_tests:run breakage URL: https://github.com/apache/beam/pull/9239#issuecomment-517886896 Run Java Load Tests Smoke This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288366) Time Spent: 9h 10m (was: 9h) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 9h 10m > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288363&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288363 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 03/Aug/19 02:20 Start Date: 03/Aug/19 02:20 Worklog Time Spent: 10m Work Description: udim commented on issue #9239: [BEAM-7060] Fix :sdks:python:apache_beam:testing:load_tests:run breakage URL: https://github.com/apache/beam/pull/9239#issuecomment-517886840 R: @aaltay This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288363) Time Spent: 8h 40m (was: 8.5h) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 8h 40m > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288364&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288364 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 03/Aug/19 02:20 Start Date: 03/Aug/19 02:20 Worklog Time Spent: 10m Work Description: udim commented on issue #9239: [BEAM-7060] Fix :sdks:python:apache_beam:testing:load_tests:run breakage URL: https://github.com/apache/beam/pull/9239#issuecomment-517886867 Run Java Load Tests Smoke This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288364) Time Spent: 8h 50m (was: 8h 40m) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 8h 50m > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288362&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288362 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 03/Aug/19 02:18 Start Date: 03/Aug/19 02:18 Worklog Time Spent: 10m Work Description: udim commented on issue #9239: [BEAM-7060] Fix :sdks:python:apache_beam:testing:load_tests:run breakage URL: https://github.com/apache/beam/pull/9239#issuecomment-517886706 run java precommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288362) Time Spent: 8.5h (was: 8h 20m) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 8.5h > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7860) v1new ReadFromDatastore returns duplicates if keys are of mixed types
[ https://issues.apache.org/jira/browse/BEAM-7860?focusedWorklogId=288360&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288360 ] ASF GitHub Bot logged work on BEAM-7860: Author: ASF GitHub Bot Created on: 03/Aug/19 02:15 Start Date: 03/Aug/19 02:15 Worklog Time Spent: 10m Work Description: udim commented on pull request #9240: [BEAM-7860] Python Datastore: fix key sort order URL: https://github.com/apache/beam/pull/9240 This is a regression from the v1 client. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python37/las
[jira] [Work logged] (BEAM-7860) v1new ReadFromDatastore returns duplicates if keys are of mixed types
[ https://issues.apache.org/jira/browse/BEAM-7860?focusedWorklogId=288361&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288361 ] ASF GitHub Bot logged work on BEAM-7860: Author: ASF GitHub Bot Created on: 03/Aug/19 02:15 Start Date: 03/Aug/19 02:15 Worklog Time Spent: 10m Work Description: udim commented on issue #9240: [BEAM-7860] Python Datastore: fix key sort order URL: https://github.com/apache/beam/pull/9240#issuecomment-517886543 Run Java Load Tests Smoke This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288361) Time Spent: 20m (was: 10m) > v1new ReadFromDatastore returns duplicates if keys are of mixed types > - > > Key: BEAM-7860 > URL: https://issues.apache.org/jira/browse/BEAM-7860 > Project: Beam > Issue Type: Bug > Components: io-python-gcp >Affects Versions: 2.13.0 > Environment: Python 2.7 > Python 3.7 >Reporter: Niels Stender >Assignee: Udi Meiri >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > In the presence of mixed type keys, v1new ReadFromDatastore may return > duplicate items. The attached example returns 4 records, not the expected 3. > > {code:java} > // code placeholder > from __future__ import unicode_literals > import apache_beam as beam > from apache_beam.io.gcp.datastore.v1new.types import Key, Entity, Query > from apache_beam.io.gcp.datastore.v1new import datastoreio > config = dict(project='your-google-project', namespace='test') > def test_mixed(): > keys = [ > Key(['mixed', '10038260-iperm_eservice'], **config), > Key(['mixed', 4812224868188160], **config), > Key(['mixed', '99152975-pointshop'], **config) > ] > entities = map(lambda key: Entity(key=key), keys) > with beam.Pipeline() as p: > (p > | beam.Create(entities) > | datastoreio.WriteToDatastore(project=config['project']) > ) > query = Query(kind='mixed', **config) > with beam.Pipeline() as p: > (p > | datastoreio.ReadFromDatastore(query=query, num_splits=4) > | beam.io.WriteToText('tmp.txt', num_shards=1, > shard_name_template='') > ) > items = open('tmp.txt').read().strip().split('\n') > assert len(items) == 3, 'incorrect number of items' > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7819) PubsubMessage message parsing is lacking non-attribute fields
[ https://issues.apache.org/jira/browse/BEAM-7819?focusedWorklogId=288357&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288357 ] ASF GitHub Bot logged work on BEAM-7819: Author: ASF GitHub Bot Created on: 03/Aug/19 01:49 Start Date: 03/Aug/19 01:49 Worklog Time Spent: 10m Work Description: aaltay commented on pull request #9232: [BEAM-7819] Python - parse PubSub message_id into attributes property URL: https://github.com/apache/beam/pull/9232#discussion_r310334401 ## File path: sdks/python/apache_beam/io/gcp/pubsub.py ## @@ -127,6 +127,10 @@ def _from_message(msg): """ # Convert ScalarMapContainer to dict. attributes = dict((key, msg.attributes[key]) for key in msg.attributes) +# Parse the PubSub message_id and add to attributes +if msg.message_id != None: Review comment: Could message_id be None? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288357) Time Spent: 0.5h (was: 20m) > PubsubMessage message parsing is lacking non-attribute fields > - > > Key: BEAM-7819 > URL: https://issues.apache.org/jira/browse/BEAM-7819 > Project: Beam > Issue Type: Bug > Components: io-python-gcp >Reporter: Ahmet Altay >Assignee: Udi Meiri >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > User reported issue: > https://lists.apache.org/thread.html/139b0c15abc6471a2e2202d76d915c645a529a23ecc32cd9cfecd315@%3Cuser.beam.apache.org%3E > """ > Looking at the source code, with my untrained python eyes, I think if the > intention is to include the message id and the publish time in the attributes > attribute of the PubSubMessage type, then the protobuf mapping is missing > something:- > @staticmethod > def _from_proto_str(proto_msg): > """Construct from serialized form of ``PubsubMessage``. > Args: > proto_msg: String containing a serialized protobuf of type > https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage > Returns: > A new PubsubMessage object. > """ > msg = pubsub.types.pubsub_pb2.PubsubMessage() > msg.ParseFromString(proto_msg) > # Convert ScalarMapContainer to dict. > attributes = dict((key, msg.attributes[key]) for key in msg.attributes) > return PubsubMessage(msg.data, attributes) > The protobuf definition is here:- > https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage > and so it looks as if the message_id and publish_time are not being parsed as > they are seperate from the attributes. Perhaps the PubsubMessage class needs > expanding to include these as attributes, or they would need adding to the > dictionary for attributes. This would only need doing for the _from_proto_str > as obviously they would not need to be populated when transmitting a message > to PubSub. > My python is not great, I'm assuming the latter option would need to look > something like this? > attributes = dict((key, msg.attributes[key]) for key in msg.attributes) > attributes.update({'message_id': msg.message_id, 'publish_time': > msg.publish_time}) > return PubsubMessage(msg.data, attributes) > """ -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (BEAM-7888) Test Multi Process Direct Runner With Largish Data
Ahmet Altay created BEAM-7888: - Summary: Test Multi Process Direct Runner With Largish Data Key: BEAM-7888 URL: https://issues.apache.org/jira/browse/BEAM-7888 Project: Beam Issue Type: Bug Components: sdk-py-core Reporter: Ahmet Altay Assignee: Hannah Jiang Filing this as a tracker. We can test multiprocess runner with a largish amount of data to the extend that we can do this on Jenkins. This will serve 2 purposes: - Find out issues related to multi processing. It would be easier to find rare issues when running over non-trivial data. - Serve as a baseline (if not a benchmark) to understand the limits of the multiprocess runner. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (BEAM-7874) FnApi only supports up to 10 workers
[ https://issues.apache.org/jira/browse/BEAM-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yifan zou updated BEAM-7874: Priority: Blocker (was: Major) > FnApi only supports up to 10 workers > > > Key: BEAM-7874 > URL: https://issues.apache.org/jira/browse/BEAM-7874 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Hannah Jiang >Assignee: Hannah Jiang >Priority: Blocker > Fix For: 2.15.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Because max_workers of grpc servers are hardcoded to 10, it only supports up > to 10 workers, and if we pass more direct_num_workers greater than 10, > pipeline hangs, because not all workers get connected to the runner. > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/fn_api_runner.py#L1141] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (BEAM-7873) FnApi with Subprocess runner hangs frequently when running with multi workers with py2
[ https://issues.apache.org/jira/browse/BEAM-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yifan zou updated BEAM-7873: Priority: Blocker (was: Major) > FnApi with Subprocess runner hangs frequently when running with multi workers > with py2 > -- > > Key: BEAM-7873 > URL: https://issues.apache.org/jira/browse/BEAM-7873 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Hannah Jiang >Assignee: Hannah Jiang >Priority: Blocker > Fix For: 2.15.0 > > > Pipeline hangs at > [subprocess.Popen()|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203] > when shut it down. I looked into source code of subprocess lib. > [py27|https://github.com/enthought/Python-2.7.3/blob/master/Lib/subprocess.py#L1286] > doesn't do any lock while > [py3|https://github.com/python/cpython/blob/3.7/Lib/subprocess.py#L1592] > locks when waiting. Py3 added locks at other places of Popen() as well, all > unlocked places with py2 may contribute to the problem. We can add a lock > when calling Popen() to prevent the deadlock. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-5878) Support DoFns with Keyword-only arguments in Python 3.
[ https://issues.apache.org/jira/browse/BEAM-5878?focusedWorklogId=288340&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288340 ] ASF GitHub Bot logged work on BEAM-5878: Author: ASF GitHub Bot Created on: 03/Aug/19 00:28 Start Date: 03/Aug/19 00:28 Worklog Time Spent: 10m Work Description: aaltay commented on issue #9237: [BEAM-5878] support DoFns with Keyword-only arguments URL: https://github.com/apache/beam/pull/9237#issuecomment-517878579 R: @tvalentyn This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288340) Time Spent: 5h 20m (was: 5h 10m) > Support DoFns with Keyword-only arguments in Python 3. > -- > > Key: BEAM-5878 > URL: https://issues.apache.org/jira/browse/BEAM-5878 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: yoshiki obata >Priority: Minor > Fix For: 2.16.0 > > Time Spent: 5h 20m > Remaining Estimate: 0h > > Python 3.0 [adds a possibility|https://www.python.org/dev/peps/pep-3102/] to > define functions with keyword-only arguments. > Currently Beam does not handle them correctly. [~ruoyu] pointed out [one > place|https://github.com/apache/beam/blob/a56ce43109c97c739fa08adca45528c41e3c925c/sdks/python/apache_beam/typehints/decorators.py#L118] > in our codebase that we should fix: in Python in 3.0 inspect.getargspec() > will fail on functions with keyword-only arguments, but a new method > [inspect.getfullargspec()|https://docs.python.org/3/library/inspect.html#inspect.getfullargspec] > supports them. > There may be implications for our (best-effort) type-hints machinery. > We should also add a Py3-only unit tests that covers DoFn's with keyword-only > arguments once Beam Python 3 tests are in a good shape. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7678) typehints with_output_types annotation doesn't work for stateful DoFn
[ https://issues.apache.org/jira/browse/BEAM-7678?focusedWorklogId=288339&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288339 ] ASF GitHub Bot logged work on BEAM-7678: Author: ASF GitHub Bot Created on: 03/Aug/19 00:21 Start Date: 03/Aug/19 00:21 Worklog Time Spent: 10m Work Description: ecanzonieri commented on pull request #9238: [BEAM-7678] Fixes bug in output element_type generation in Kv PipelineVisitor URL: https://github.com/apache/beam/pull/9238 This review tries to address BEAM-7678. When we use typehints in the output the `element_type` is already defined so in theory there is no need to try to infer it. There are some types that cannot be inferred by the function `infer_output_type`, in these cases the typehint should be used to bypass the inferred type. I'm not sure how to test this code, I can reproduce the issue in a manual test, using as type a class that for some reason the `infer_output_type` is not able to infer. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [x] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https://builds.a
[jira] [Work logged] (BEAM-5878) Support DoFns with Keyword-only arguments in Python 3.
[ https://issues.apache.org/jira/browse/BEAM-5878?focusedWorklogId=288338&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288338 ] ASF GitHub Bot logged work on BEAM-5878: Author: ASF GitHub Bot Created on: 03/Aug/19 00:19 Start Date: 03/Aug/19 00:19 Worklog Time Spent: 10m Work Description: lazylynx commented on pull request #9237: [BEAM-5878] support DoFns with Keyword-only arguments URL: https://github.com/apache/beam/pull/9237 support DoFns with Keyword-only arguments in Python 3 and add test reverted in #8750 with no syntax error in Python 2 tests condition are fixed as follows due to errors: # test_side_input_keyword_only_args ``` result2 = pcol | 'compute2' >> beam.FlatMap( sort_with_side_inputs, beam.pvalue.AsIter(side)) ``` to ``` result2 = pcol | 'compute2' >> beam.FlatMap( sort_with_side_inputs, beam.pvalue.AsList(side)) ``` # test_combine_keyword_only_args ``` assert_that(result2, equal_to([23]), label='assert2') ``` to ``` assert_that(result2, equal_to([49]), label='assert2') ``` Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2
[jira] [Work logged] (BEAM-7667) report GCS throttling time to Dataflow autoscaler
[ https://issues.apache.org/jira/browse/BEAM-7667?focusedWorklogId=288334&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288334 ] ASF GitHub Bot logged work on BEAM-7667: Author: ASF GitHub Bot Created on: 02/Aug/19 23:57 Start Date: 02/Aug/19 23:57 Worklog Time Spent: 10m Work Description: chamikaramj commented on pull request #8973: [BEAM-7667] report GCS throttling time to Dataflow autoscaler URL: https://github.com/apache/beam/pull/8973#discussion_r310327778 ## File path: sdks/python/apache_beam/io/gcp/internal/clients/storage/storage_v1_client.py ## @@ -20,10 +20,17 @@ from __future__ import absolute_import +import logging +import time Review comment: I believe this class is auto-generated. So this change might get reverted next time we update the client. We should at least add a test that prevents this and move most of the changes here to another file (prob. everything other than "retry_func=xxx". This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288334) Time Spent: 1.5h (was: 1h 20m) > report GCS throttling time to Dataflow autoscaler > - > > Key: BEAM-7667 > URL: https://issues.apache.org/jira/browse/BEAM-7667 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > report GCS throttling time to Dataflow autoscaler. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (BEAM-7860) v1new ReadFromDatastore returns duplicates if keys are of mixed types
[ https://issues.apache.org/jira/browse/BEAM-7860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899271#comment-16899271 ] Udi Meiri edited comment on BEAM-7860 at 8/2/19 11:51 PM: -- Verified that this happens in both v1 and v1new. Edit: not verified in v1 after all (I thought I could reproduce in a unit test, but I had a bug). I did recreate the issue in v1new using the sample code. was (Author: udim): Verified that this happens in both v1 and v1new. > v1new ReadFromDatastore returns duplicates if keys are of mixed types > - > > Key: BEAM-7860 > URL: https://issues.apache.org/jira/browse/BEAM-7860 > Project: Beam > Issue Type: Bug > Components: io-python-gcp >Affects Versions: 2.13.0 > Environment: Python 2.7 > Python 3.7 >Reporter: Niels Stender >Assignee: Udi Meiri >Priority: Major > > In the presence of mixed type keys, v1new ReadFromDatastore may return > duplicate items. The attached example returns 4 records, not the expected 3. > > {code:java} > // code placeholder > from __future__ import unicode_literals > import apache_beam as beam > from apache_beam.io.gcp.datastore.v1new.types import Key, Entity, Query > from apache_beam.io.gcp.datastore.v1new import datastoreio > config = dict(project='your-google-project', namespace='test') > def test_mixed(): > keys = [ > Key(['mixed', '10038260-iperm_eservice'], **config), > Key(['mixed', 4812224868188160], **config), > Key(['mixed', '99152975-pointshop'], **config) > ] > entities = map(lambda key: Entity(key=key), keys) > with beam.Pipeline() as p: > (p > | beam.Create(entities) > | datastoreio.WriteToDatastore(project=config['project']) > ) > query = Query(kind='mixed', **config) > with beam.Pipeline() as p: > (p > | datastoreio.ReadFromDatastore(query=query, num_splits=4) > | beam.io.WriteToText('tmp.txt', num_shards=1, > shard_name_template='') > ) > items = open('tmp.txt').read().strip().split('\n') > assert len(items) == 3, 'incorrect number of items' > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7877) Change the log level when deleting unknown temporary files in FileBasedSink
[ https://issues.apache.org/jira/browse/BEAM-7877?focusedWorklogId=288333&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288333 ] ASF GitHub Bot logged work on BEAM-7877: Author: ASF GitHub Bot Created on: 02/Aug/19 23:51 Start Date: 02/Aug/19 23:51 Worklog Time Spent: 10m Work Description: aaltay commented on pull request #9227: [BEAM-7877] Change the log level when deleting unknown temoprary files URL: https://github.com/apache/beam/pull/9227 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288333) Time Spent: 1h (was: 50m) > Change the log level when deleting unknown temporary files in FileBasedSink > --- > > Key: BEAM-7877 > URL: https://issues.apache.org/jira/browse/BEAM-7877 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > Currently the log level is info. A new proposed log level is warning since > deleting unknown temporary files is a bad sign and sometimes leads to data > loss. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7744) LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink
[ https://issues.apache.org/jira/browse/BEAM-7744?focusedWorklogId=288330&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288330 ] ASF GitHub Bot logged work on BEAM-7744: Author: ASF GitHub Bot Created on: 02/Aug/19 23:50 Start Date: 02/Aug/19 23:50 Worklog Time Spent: 10m Work Description: aaltay commented on issue #9071: [BEAM-7744] LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink URL: https://github.com/apache/beam/pull/9071#issuecomment-517874112 I am guessing the other error for JavaPortabilityApi test is fine because the error is "14:13:05 Task 'javaPreCommitPortabilityApi' not found in root project 'beam'." It does not look like test exists in this branch. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288330) Time Spent: 3h (was: 2h 50m) > LTS backport: Temporary directory for WriteOperation may not be unique in > FileBaseSink > -- > > Key: BEAM-7744 > URL: https://issues.apache.org/jira/browse/BEAM-7744 > Project: Beam > Issue Type: Bug > Components: io-java-files >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: Major > Fix For: 2.7.1 > > Time Spent: 3h > Remaining Estimate: 0h > > Tracking BEAM-7689 LTS backport for 2.7.1 release -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7744) LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink
[ https://issues.apache.org/jira/browse/BEAM-7744?focusedWorklogId=288332&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288332 ] ASF GitHub Bot logged work on BEAM-7744: Author: ASF GitHub Bot Created on: 02/Aug/19 23:50 Start Date: 02/Aug/19 23:50 Worklog Time Spent: 10m Work Description: aaltay commented on pull request #9071: [BEAM-7744] LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink URL: https://github.com/apache/beam/pull/9071 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288332) Time Spent: 3h 10m (was: 3h) > LTS backport: Temporary directory for WriteOperation may not be unique in > FileBaseSink > -- > > Key: BEAM-7744 > URL: https://issues.apache.org/jira/browse/BEAM-7744 > Project: Beam > Issue Type: Bug > Components: io-java-files >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: Major > Fix For: 2.7.1 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Tracking BEAM-7689 LTS backport for 2.7.1 release -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7860) v1new ReadFromDatastore returns duplicates if keys are of mixed types
[ https://issues.apache.org/jira/browse/BEAM-7860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899273#comment-16899273 ] Udi Meiri commented on BEAM-7860: - I should be possible to work around this by setting num_splits=1 in ReadFromDatastore. You'll get a SplitNotPossibleError and performance will suffer (depending on the size of the result), but it'll be correct. > v1new ReadFromDatastore returns duplicates if keys are of mixed types > - > > Key: BEAM-7860 > URL: https://issues.apache.org/jira/browse/BEAM-7860 > Project: Beam > Issue Type: Bug > Components: io-python-gcp >Affects Versions: 2.13.0 > Environment: Python 2.7 > Python 3.7 >Reporter: Niels Stender >Assignee: Udi Meiri >Priority: Major > > In the presence of mixed type keys, v1new ReadFromDatastore may return > duplicate items. The attached example returns 4 records, not the expected 3. > > {code:java} > // code placeholder > from __future__ import unicode_literals > import apache_beam as beam > from apache_beam.io.gcp.datastore.v1new.types import Key, Entity, Query > from apache_beam.io.gcp.datastore.v1new import datastoreio > config = dict(project='your-google-project', namespace='test') > def test_mixed(): > keys = [ > Key(['mixed', '10038260-iperm_eservice'], **config), > Key(['mixed', 4812224868188160], **config), > Key(['mixed', '99152975-pointshop'], **config) > ] > entities = map(lambda key: Entity(key=key), keys) > with beam.Pipeline() as p: > (p > | beam.Create(entities) > | datastoreio.WriteToDatastore(project=config['project']) > ) > query = Query(kind='mixed', **config) > with beam.Pipeline() as p: > (p > | datastoreio.ReadFromDatastore(query=query, num_splits=4) > | beam.io.WriteToText('tmp.txt', num_shards=1, > shard_name_template='') > ) > items = open('tmp.txt').read().strip().split('\n') > assert len(items) == 3, 'incorrect number of items' > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (BEAM-7860) v1new ReadFromDatastore returns duplicates if keys are of mixed types
[ https://issues.apache.org/jira/browse/BEAM-7860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899273#comment-16899273 ] Udi Meiri edited comment on BEAM-7860 at 8/2/19 11:18 PM: -- It should be possible to work around this by setting num_splits=1 in ReadFromDatastore. You'll get a SplitNotPossibleError and performance will suffer (depending on the size of the result), but it'll be correct. was (Author: udim): I should be possible to work around this by setting num_splits=1 in ReadFromDatastore. You'll get a SplitNotPossibleError and performance will suffer (depending on the size of the result), but it'll be correct. > v1new ReadFromDatastore returns duplicates if keys are of mixed types > - > > Key: BEAM-7860 > URL: https://issues.apache.org/jira/browse/BEAM-7860 > Project: Beam > Issue Type: Bug > Components: io-python-gcp >Affects Versions: 2.13.0 > Environment: Python 2.7 > Python 3.7 >Reporter: Niels Stender >Assignee: Udi Meiri >Priority: Major > > In the presence of mixed type keys, v1new ReadFromDatastore may return > duplicate items. The attached example returns 4 records, not the expected 3. > > {code:java} > // code placeholder > from __future__ import unicode_literals > import apache_beam as beam > from apache_beam.io.gcp.datastore.v1new.types import Key, Entity, Query > from apache_beam.io.gcp.datastore.v1new import datastoreio > config = dict(project='your-google-project', namespace='test') > def test_mixed(): > keys = [ > Key(['mixed', '10038260-iperm_eservice'], **config), > Key(['mixed', 4812224868188160], **config), > Key(['mixed', '99152975-pointshop'], **config) > ] > entities = map(lambda key: Entity(key=key), keys) > with beam.Pipeline() as p: > (p > | beam.Create(entities) > | datastoreio.WriteToDatastore(project=config['project']) > ) > query = Query(kind='mixed', **config) > with beam.Pipeline() as p: > (p > | datastoreio.ReadFromDatastore(query=query, num_splits=4) > | beam.io.WriteToText('tmp.txt', num_shards=1, > shard_name_template='') > ) > items = open('tmp.txt').read().strip().split('\n') > assert len(items) == 3, 'incorrect number of items' > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7860) v1new ReadFromDatastore returns duplicates if keys are of mixed types
[ https://issues.apache.org/jira/browse/BEAM-7860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899271#comment-16899271 ] Udi Meiri commented on BEAM-7860: - Verified that this happens in both v1 and v1new. > v1new ReadFromDatastore returns duplicates if keys are of mixed types > - > > Key: BEAM-7860 > URL: https://issues.apache.org/jira/browse/BEAM-7860 > Project: Beam > Issue Type: Bug > Components: io-python-gcp >Affects Versions: 2.13.0 > Environment: Python 2.7 > Python 3.7 >Reporter: Niels Stender >Assignee: Udi Meiri >Priority: Major > > In the presence of mixed type keys, v1new ReadFromDatastore may return > duplicate items. The attached example returns 4 records, not the expected 3. > > {code:java} > // code placeholder > from __future__ import unicode_literals > import apache_beam as beam > from apache_beam.io.gcp.datastore.v1new.types import Key, Entity, Query > from apache_beam.io.gcp.datastore.v1new import datastoreio > config = dict(project='your-google-project', namespace='test') > def test_mixed(): > keys = [ > Key(['mixed', '10038260-iperm_eservice'], **config), > Key(['mixed', 4812224868188160], **config), > Key(['mixed', '99152975-pointshop'], **config) > ] > entities = map(lambda key: Entity(key=key), keys) > with beam.Pipeline() as p: > (p > | beam.Create(entities) > | datastoreio.WriteToDatastore(project=config['project']) > ) > query = Query(kind='mixed', **config) > with beam.Pipeline() as p: > (p > | datastoreio.ReadFromDatastore(query=query, num_splits=4) > | beam.io.WriteToText('tmp.txt', num_shards=1, > shard_name_template='') > ) > items = open('tmp.txt').read().strip().split('\n') > assert len(items) == 3, 'incorrect number of items' > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (BEAM-7887) Release automation improvement
Mark Liu created BEAM-7887: -- Summary: Release automation improvement Key: BEAM-7887 URL: https://issues.apache.org/jira/browse/BEAM-7887 Project: Beam Issue Type: Improvement Components: build-system, testing, website Reporter: Mark Liu This is an epic jira to track improvement work for Beam release automation. Main goals are: - Address pain points during full cycle of release process in automatic way. - Improve existing automation tools to be more user friendly. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-5820) Vendor Calcite
[ https://issues.apache.org/jira/browse/BEAM-5820?focusedWorklogId=288319&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288319 ] ASF GitHub Bot logged work on BEAM-5820: Author: ASF GitHub Bot Created on: 02/Aug/19 23:03 Start Date: 02/Aug/19 23:03 Worklog Time Spent: 10m Work Description: apilloud commented on pull request #9189: [BEAM-5820] [PoC] vendor calcite URL: https://github.com/apache/beam/pull/9189#discussion_r310321188 ## File path: vendor/calcite-1_19_0/build.gradle ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +plugins { id 'org.apache.beam.vendor-java' } + +description = "Apache Beam :: Vendored Dependencies :: Calcite 1.19.0" + +group = "org.apache.beam" +version = "0.1" + +def calcite_version = "1.19.0" +def avatica_version = "1.15.0" +def prefix = "org.apache.beam.vendor.calcite.v1_19_0" + +vendorJava( + dependencies: [ +"org.apache.calcite:calcite-core:$calcite_version", +"org.apache.calcite:calcite-linq4j:$calcite_version", +"org.apache.calcite.avatica:avatica-core:$avatica_version", + ], + relocations: [ +"org.apache.calcite": "${prefix}.org.apache.calcite", + +// Calcite has Guava on its API surface +"com.google.common": "org.apache.beam.vendor.guava.v26_0_jre.com.google.thirdparty", Review comment: I don't understand this. Why are we moving "com.google.common" to "com.google.thirdparty"? This does not match guava: https://github.com/apache/beam/blob/f7cbf88f550c8918b99a13af4182d6efa07cd2b5/vendor/guava-26_0-jre/build.gradle#L29 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288319) Time Spent: 20m (was: 10m) > Vendor Calcite > -- > > Key: BEAM-5820 > URL: https://issues.apache.org/jira/browse/BEAM-5820 > Project: Beam > Issue Type: Sub-task > Components: dsl-sql >Reporter: Kenneth Knowles >Assignee: Kai Jiang >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288318&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288318 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 02/Aug/19 23:01 Start Date: 02/Aug/19 23:01 Worklog Time Spent: 10m Work Description: udim commented on pull request #9223: [BEAM-7060] Introduce Python3-only test modules URL: https://github.com/apache/beam/pull/9223 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288318) Time Spent: 8h 20m (was: 8h 10m) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 8h 20m > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7832) ZetaSQL Dialect
[ https://issues.apache.org/jira/browse/BEAM-7832?focusedWorklogId=288314&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288314 ] ASF GitHub Bot logged work on BEAM-7832: Author: ASF GitHub Bot Created on: 02/Aug/19 22:57 Start Date: 02/Aug/19 22:57 Worklog Time Spent: 10m Work Description: apilloud commented on issue #9210: [BEAM-7832] Add ZetaSQL as a dialect in BeamSQL URL: https://github.com/apache/beam/pull/9210#issuecomment-517866543 The grpc issue might be a ZetaSQL bug, I'll work with you to get an isolated repro on Monday. As for moving this to a separate package, I would consider that a blocker to this merging. ZetaSQL currently only works on a limited subset of linux machines, we don't want to break other users and developers. I see #9189 is waiting for my review, so I'm going to do that now to help move this forward. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288314) Time Spent: 3h 10m (was: 3h) > ZetaSQL Dialect > --- > > Key: BEAM-7832 > URL: https://issues.apache.org/jira/browse/BEAM-7832 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Time Spent: 3h 10m > Remaining Estimate: 0h > > We can support ZetaSQL(https://github.com/google/zetasql) dialect in BeamSQL. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288313&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288313 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 02/Aug/19 22:55 Start Date: 02/Aug/19 22:55 Worklog Time Spent: 10m Work Description: chamikaramj commented on issue #9223: [BEAM-7060] Introduce Python3-only test modules URL: https://github.com/apache/beam/pull/9223#issuecomment-517866303 LGTM. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288313) Time Spent: 8h 10m (was: 8h) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 8h 10m > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288311&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288311 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 02/Aug/19 22:51 Start Date: 02/Aug/19 22:51 Worklog Time Spent: 10m Work Description: udim commented on issue #9223: [BEAM-7060] Introduce Python3-only test modules URL: https://github.com/apache/beam/pull/9223#issuecomment-517865810 Error in postcommit is a transient pip issue This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288311) Time Spent: 8h (was: 7h 50m) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 8h > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7744) LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink
[ https://issues.apache.org/jira/browse/BEAM-7744?focusedWorklogId=288305&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288305 ] ASF GitHub Bot logged work on BEAM-7744: Author: ASF GitHub Bot Created on: 02/Aug/19 22:40 Start Date: 02/Aug/19 22:40 Worklog Time Spent: 10m Work Description: ihji commented on issue #9071: [BEAM-7744] LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink URL: https://github.com/apache/beam/pull/9071#issuecomment-517863870 @kennknowles @aaltay Java precommit passed. PTAL. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288305) Time Spent: 2h 50m (was: 2h 40m) > LTS backport: Temporary directory for WriteOperation may not be unique in > FileBaseSink > -- > > Key: BEAM-7744 > URL: https://issues.apache.org/jira/browse/BEAM-7744 > Project: Beam > Issue Type: Bug > Components: io-java-files >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: Major > Fix For: 2.7.1 > > Time Spent: 2h 50m > Remaining Estimate: 0h > > Tracking BEAM-7689 LTS backport for 2.7.1 release -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7877) Change the log level when deleting unknown temporary files in FileBasedSink
[ https://issues.apache.org/jira/browse/BEAM-7877?focusedWorklogId=288304&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288304 ] ASF GitHub Bot logged work on BEAM-7877: Author: ASF GitHub Bot Created on: 02/Aug/19 22:38 Start Date: 02/Aug/19 22:38 Worklog Time Spent: 10m Work Description: ihji commented on issue #9227: [BEAM-7877] Change the log level when deleting unknown temoprary files URL: https://github.com/apache/beam/pull/9227#issuecomment-517863575 run java precommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288304) Time Spent: 50m (was: 40m) > Change the log level when deleting unknown temporary files in FileBasedSink > --- > > Key: BEAM-7877 > URL: https://issues.apache.org/jira/browse/BEAM-7877 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > Currently the log level is info. A new proposed log level is warning since > deleting unknown temporary files is a bad sign and sometimes leads to data > loss. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7832) ZetaSQL Dialect
[ https://issues.apache.org/jira/browse/BEAM-7832?focusedWorklogId=288303&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288303 ] ASF GitHub Bot logged work on BEAM-7832: Author: ASF GitHub Bot Created on: 02/Aug/19 22:30 Start Date: 02/Aug/19 22:30 Worklog Time Spent: 10m Work Description: amaliujia commented on issue #9210: [BEAM-7832] Add ZetaSQL as a dialect in BeamSQL URL: https://github.com/apache/beam/pull/9210#issuecomment-517862226 During the process of moving zetasql planner to another package, I also experienced missing grpc dependency. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288303) Time Spent: 3h (was: 2h 50m) > ZetaSQL Dialect > --- > > Key: BEAM-7832 > URL: https://issues.apache.org/jira/browse/BEAM-7832 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Time Spent: 3h > Remaining Estimate: 0h > > We can support ZetaSQL(https://github.com/google/zetasql) dialect in BeamSQL. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (BEAM-7886) Make row coder a standard coder and implement in python
Brian Hulette created BEAM-7886: --- Summary: Make row coder a standard coder and implement in python Key: BEAM-7886 URL: https://issues.apache.org/jira/browse/BEAM-7886 Project: Beam Issue Type: Improvement Components: sdk-java-core, sdk-py-core Reporter: Brian Hulette Assignee: Brian Hulette -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (BEAM-7886) Make row coder a standard coder and implement in python
[ https://issues.apache.org/jira/browse/BEAM-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette updated BEAM-7886: Component/s: beam-model > Make row coder a standard coder and implement in python > --- > > Key: BEAM-7886 > URL: https://issues.apache.org/jira/browse/BEAM-7886 > Project: Beam > Issue Type: Improvement > Components: beam-model, sdk-java-core, sdk-py-core >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-6674) The JdbcIO source should produce schemas
[ https://issues.apache.org/jira/browse/BEAM-6674?focusedWorklogId=288298&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288298 ] ASF GitHub Bot logged work on BEAM-6674: Author: ASF GitHub Bot Created on: 02/Aug/19 21:58 Start Date: 02/Aug/19 21:58 Worklog Time Spent: 10m Work Description: terekete commented on issue #8725: [BEAM-6674] Add schema support to JdbcIO read URL: https://github.com/apache/beam/pull/8725#issuecomment-517855951 Hey all - I was wondering if there is an example or docs on how to use the the new implementation to read rows using the Row Coder and Row Mapper and auto-infer the beam schema ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288298) Time Spent: 4h 10m (was: 4h) > The JdbcIO source should produce schemas > > > Key: BEAM-6674 > URL: https://issues.apache.org/jira/browse/BEAM-6674 > Project: Beam > Issue Type: Sub-task > Components: io-java-jdbc >Reporter: Reuven Lax >Assignee: Charith Ellawala >Priority: Major > Fix For: 2.14.0 > > Time Spent: 4h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (BEAM-7788) Go modules breaking precommits?
[ https://issues.apache.org/jira/browse/BEAM-7788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Udi Meiri resolved BEAM-7788. - Resolution: Fixed Fix Version/s: Not applicable > Go modules breaking precommits? > --- > > Key: BEAM-7788 > URL: https://issues.apache.org/jira/browse/BEAM-7788 > Project: Beam > Issue Type: Bug > Components: sdk-go, testing >Reporter: Udi Meiri >Priority: Major > Fix For: Not applicable > > Time Spent: 1h > Remaining Estimate: 0h > > Lots of errors like: > {code} > 01:32:36 > git clean -fdx # timeout=10 > 01:32:38 FATAL: Command "git clean -fdx" returned status code 1: > 01:32:38 stdout: > 01:32:38 stderr: warning: failed to remove > sdks/go/.gogradle/project_gopath/pkg/mod/go.opencensus.io@v0.22.0/zpages/example_test.go > ... [lots of files listed] > {code} > Examples: > https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Commit/3886/consoleFull > https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Commit/3884/consoleFull > https://builds.apache.org/job/beam_PreCommit_JavaPortabilityApi_Commit/3790/consoleFull > https://builds.apache.org/job/beam_PreCommit_JavaPortabilityApi_Commit/3788/consoleFull > cc: [~lostluck] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7777) Stream Cost Model and RelNode Cost Estimation
[ https://issues.apache.org/jira/browse/BEAM-?focusedWorklogId=288274&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288274 ] ASF GitHub Bot logged work on BEAM-: Author: ASF GitHub Bot Created on: 02/Aug/19 21:29 Start Date: 02/Aug/19 21:29 Worklog Time Spent: 10m Work Description: akedin commented on pull request #9217: [BEAM-] Beam cost model URL: https://github.com/apache/beam/pull/9217#discussion_r310303783 ## File path: sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamCostModelTest.java ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.sql.impl.planner; + +import org.junit.Assert; +import org.junit.Test; + +/** Tests the behavior of BeamCostModel. */ +public class BeamCostModelTest { + + // We should activate it when all the nodes are implementing the cost model. + // @Test Review comment: What's left to implement? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288274) Time Spent: 50m (was: 40m) > Stream Cost Model and RelNode Cost Estimation > - > > Key: BEAM- > URL: https://issues.apache.org/jira/browse/BEAM- > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Alireza Samadianzakaria >Assignee: Alireza Samadianzakaria >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > Currently the cost model is not suitable for streaming jobs. (it uses row > count and for estimating the output row count of each node it is using > calcite estimations that are only applicable to bounded data.) > We need to implement a new cost model and implement cost estimation for all > of our physical nodes. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7777) Stream Cost Model and RelNode Cost Estimation
[ https://issues.apache.org/jira/browse/BEAM-?focusedWorklogId=288273&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288273 ] ASF GitHub Bot logged work on BEAM-: Author: ASF GitHub Bot Created on: 02/Aug/19 21:26 Start Date: 02/Aug/19 21:26 Worklog Time Spent: 10m Work Description: akedin commented on issue #9217: [BEAM-] Beam cost model URL: https://github.com/apache/beam/pull/9217#issuecomment-517849128 r: @kennknowles @kanterov This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288273) Time Spent: 40m (was: 0.5h) > Stream Cost Model and RelNode Cost Estimation > - > > Key: BEAM- > URL: https://issues.apache.org/jira/browse/BEAM- > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Alireza Samadianzakaria >Assignee: Alireza Samadianzakaria >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > Currently the cost model is not suitable for streaming jobs. (it uses row > count and for estimating the output row count of each node it is using > calcite estimations that are only applicable to bounded data.) > We need to implement a new cost model and implement cost estimation for all > of our physical nodes. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7777) Stream Cost Model and RelNode Cost Estimation
[ https://issues.apache.org/jira/browse/BEAM-?focusedWorklogId=288272&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288272 ] ASF GitHub Bot logged work on BEAM-: Author: ASF GitHub Bot Created on: 02/Aug/19 21:26 Start Date: 02/Aug/19 21:26 Worklog Time Spent: 10m Work Description: akedin commented on issue #9217: [BEAM-] Beam cost model URL: https://github.com/apache/beam/pull/9217#issuecomment-517849012 run sql postcommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288272) Time Spent: 0.5h (was: 20m) > Stream Cost Model and RelNode Cost Estimation > - > > Key: BEAM- > URL: https://issues.apache.org/jira/browse/BEAM- > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Alireza Samadianzakaria >Assignee: Alireza Samadianzakaria >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently the cost model is not suitable for streaming jobs. (it uses row > count and for estimating the output row count of each node it is using > calcite estimations that are only applicable to bounded data.) > We need to implement a new cost model and implement cost estimation for all > of our physical nodes. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7744) LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink
[ https://issues.apache.org/jira/browse/BEAM-7744?focusedWorklogId=288271&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288271 ] ASF GitHub Bot logged work on BEAM-7744: Author: ASF GitHub Bot Created on: 02/Aug/19 21:25 Start Date: 02/Aug/19 21:25 Worklog Time Spent: 10m Work Description: ihji commented on issue #9071: [BEAM-7744] LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink URL: https://github.com/apache/beam/pull/9071#issuecomment-517845800 Run JavaPortabilityApi PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288271) Time Spent: 2h 40m (was: 2.5h) > LTS backport: Temporary directory for WriteOperation may not be unique in > FileBaseSink > -- > > Key: BEAM-7744 > URL: https://issues.apache.org/jira/browse/BEAM-7744 > Project: Beam > Issue Type: Bug > Components: io-java-files >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: Major > Fix For: 2.7.1 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Tracking BEAM-7689 LTS backport for 2.7.1 release -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288270&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288270 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 02/Aug/19 21:25 Start Date: 02/Aug/19 21:25 Worklog Time Spent: 10m Work Description: udim commented on issue #9223: [BEAM-7060] Introduce Python3-only test modules URL: https://github.com/apache/beam/pull/9223#issuecomment-517848786 Run Python Dataflow ValidatesContainer This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288270) Time Spent: 7h 50m (was: 7h 40m) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 7h 50m > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288268&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288268 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 02/Aug/19 21:24 Start Date: 02/Aug/19 21:24 Worklog Time Spent: 10m Work Description: udim commented on issue #9223: [BEAM-7060] Introduce Python3-only test modules URL: https://github.com/apache/beam/pull/9223#issuecomment-517848618 run python 2 postcommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288268) Time Spent: 7h 40m (was: 7.5h) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 7h 40m > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7744) LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink
[ https://issues.apache.org/jira/browse/BEAM-7744?focusedWorklogId=288262&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288262 ] ASF GitHub Bot logged work on BEAM-7744: Author: ASF GitHub Bot Created on: 02/Aug/19 21:12 Start Date: 02/Aug/19 21:12 Worklog Time Spent: 10m Work Description: ihji commented on issue #9071: [BEAM-7744] LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink URL: https://github.com/apache/beam/pull/9071#issuecomment-517845800 Run JavaPortabilityApi PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288262) Time Spent: 2.5h (was: 2h 20m) > LTS backport: Temporary directory for WriteOperation may not be unique in > FileBaseSink > -- > > Key: BEAM-7744 > URL: https://issues.apache.org/jira/browse/BEAM-7744 > Project: Beam > Issue Type: Bug > Components: io-java-files >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: Major > Fix For: 2.7.1 > > Time Spent: 2.5h > Remaining Estimate: 0h > > Tracking BEAM-7689 LTS backport for 2.7.1 release -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7744) LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink
[ https://issues.apache.org/jira/browse/BEAM-7744?focusedWorklogId=288261&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288261 ] ASF GitHub Bot logged work on BEAM-7744: Author: ASF GitHub Bot Created on: 02/Aug/19 21:12 Start Date: 02/Aug/19 21:12 Worklog Time Spent: 10m Work Description: ihji commented on issue #9071: [BEAM-7744] LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink URL: https://github.com/apache/beam/pull/9071#issuecomment-517845765 Run Java PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288261) Time Spent: 2h 20m (was: 2h 10m) > LTS backport: Temporary directory for WriteOperation may not be unique in > FileBaseSink > -- > > Key: BEAM-7744 > URL: https://issues.apache.org/jira/browse/BEAM-7744 > Project: Beam > Issue Type: Bug > Components: io-java-files >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: Major > Fix For: 2.7.1 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > Tracking BEAM-7689 LTS backport for 2.7.1 release -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-562) DoFn Reuse: Add new DoFn setup and teardown to python SDK
[ https://issues.apache.org/jira/browse/BEAM-562?focusedWorklogId=288259&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288259 ] ASF GitHub Bot logged work on BEAM-562: --- Author: ASF GitHub Bot Created on: 02/Aug/19 21:11 Start Date: 02/Aug/19 21:11 Worklog Time Spent: 10m Work Description: NikeNano commented on issue #7994: [BEAM-562] Add DoFn.setup and DoFn.teardown to Python SDK URL: https://github.com/apache/beam/pull/7994#issuecomment-517845557 Created issues for DoFn.setup, https://issues.apache.org/jira/browse/BEAM-7885 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288259) Time Spent: 11.5h (was: 11h 20m) > DoFn Reuse: Add new DoFn setup and teardown to python SDK > - > > Key: BEAM-562 > URL: https://issues.apache.org/jira/browse/BEAM-562 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Ahmet Altay >Assignee: Yifan Mai >Priority: Major > Labels: sdk-consistency > Fix For: 2.14.0 > > Time Spent: 11.5h > Remaining Estimate: 0h > > Java SDK added setup and teardown methods to the DoFns. This makes DoFns > reusable and provide performance improvements. Python SDK should add support > for these new DoFn methods: > Proposal doc: > https://docs.google.com/document/d/1LLQqggSePURt3XavKBGV7SZJYQ4NW8yCu63lBchzMRk/edit?ts=5771458f# -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7877) Change the log level when deleting unknown temporary files in FileBasedSink
[ https://issues.apache.org/jira/browse/BEAM-7877?focusedWorklogId=288258&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288258 ] ASF GitHub Bot logged work on BEAM-7877: Author: ASF GitHub Bot Created on: 02/Aug/19 21:10 Start Date: 02/Aug/19 21:10 Worklog Time Spent: 10m Work Description: ihji commented on pull request #9227: [BEAM-7877] Change the log level when deleting unknown temoprary files URL: https://github.com/apache/beam/pull/9227#discussion_r310299085 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java ## @@ -791,7 +791,7 @@ final void removeTemporaryFiles( FileSystems.match(Collections.singletonList(tempDir.toString() + "*"))); for (Metadata matchResult : singleMatch.metadata()) { if (allMatches.add(matchResult.resourceId())) { - LOG.info("Will also remove unknown temporary file {}", matchResult.resourceId()); + LOG.warn("Will also remove unknown temporary file {}", matchResult.resourceId()); Review comment: Thanks for the comment. Updated the log message. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288258) Time Spent: 40m (was: 0.5h) > Change the log level when deleting unknown temporary files in FileBasedSink > --- > > Key: BEAM-7877 > URL: https://issues.apache.org/jira/browse/BEAM-7877 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > Currently the log level is info. A new proposed log level is warning since > deleting unknown temporary files is a bad sign and sometimes leads to data > loss. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (BEAM-7885) DoFn.setup() don't run for streaming jobs.
niklas Hansson created BEAM-7885: Summary: DoFn.setup() don't run for streaming jobs. Key: BEAM-7885 URL: https://issues.apache.org/jira/browse/BEAM-7885 Project: Beam Issue Type: Bug Components: sdk-py-core Affects Versions: 2.14.0 Environment: Python Reporter: niklas Hansson >From version 2.14.0 Python have introduced setup and teardown for DoFn in >order to "Called to prepare an instance for processing bundles of >elements.This is a good place to initialize transient in-memory resources, >such as network connections." However when trying to use it for a unbounded job (pubsub source) it seams like the DoFn.setup() is never called and the resources are never initialize. Instead I get: AttributeError: 'NoneType' object has no attribute 'predict' [while running 'transform the data'] """ My source code: [https://github.com/NikeNano/DataflowSklearnStreaming] I am happy to contribute with example code for how to use setup as soon as I get it running :) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues
[ https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288246&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288246 ] ASF GitHub Bot logged work on BEAM-7866: Author: ASF GitHub Bot Created on: 02/Aug/19 20:56 Start Date: 02/Aug/19 20:56 Worklog Time Spent: 10m Work Description: y1chi commented on pull request #9233: [BEAM-7866] Fix python ReadFromMongoDB potential data loss issue URL: https://github.com/apache/beam/pull/9233#discussion_r310295208 ## File path: sdks/python/apache_beam/io/mongodbio.py ## @@ -139,50 +143,77 @@ def __init__(self, self.filter = filter self.projection = projection self.spec = extra_client_params -self.doc_count = self._get_document_count() -self.avg_doc_size = self._get_avg_document_size() -self.client = None def estimate_size(self): -return self.avg_doc_size * self.doc_count +with MongoClient(self.uri, **self.spec) as client: + size = client[self.db].command('collstats', self.coll).get('size') + if size is None or size <= 0: +raise ValueError('Collection %s not found or total doc size is ' + 'incorrect' % self.coll) + return size def split(self, desired_bundle_size, start_position=None, stop_position=None): # use document cursor index as the start and stop positions if start_position is None: - start_position = 0 + epoch = datetime.datetime(1970, 1, 1) + start_position = objectid.ObjectId.from_datetime(epoch) if stop_position is None: - stop_position = self.doc_count + last_doc_id = self._get_last_document_id() + # add one sec to make sure the last document is not excluded + last_timestamp_plus_one_sec = (last_doc_id.generation_time + + datetime.timedelta(seconds=1)) + stop_position = objectid.ObjectId.from_datetime( + last_timestamp_plus_one_sec) -# get an estimate on how many documents should be included in a split batch -desired_bundle_count = desired_bundle_size // self.avg_doc_size +desired_bundle_size_in_mb = desired_bundle_size // 1024 // 1024 +split_keys = self._get_split_keys(desired_bundle_size_in_mb, start_position, + stop_position) bundle_start = start_position -while bundle_start < stop_position: - bundle_end = min(stop_position, bundle_start + desired_bundle_count) - yield iobase.SourceBundle(weight=bundle_end - bundle_start, +for split_key_id in split_keys: + bundle_end = min(stop_position, split_key_id) + if bundle_start is None and bundle_start < stop_position: +return + yield iobase.SourceBundle(weight=desired_bundle_size_in_mb, source=self, start_position=bundle_start, stop_position=bundle_end) bundle_start = bundle_end +# add range of last split_key to stop_position +if bundle_start < stop_position: + yield iobase.SourceBundle(weight=desired_bundle_size_in_mb, +source=self, +start_position=bundle_start, +stop_position=stop_position) def get_range_tracker(self, start_position, stop_position): if start_position is None: - start_position = 0 + epoch = datetime.datetime(1970, 1, 1) Review comment: My understanding is when the first split happens. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288246) Time Spent: 2h 20m (was: 2h 10m) > Python MongoDB IO performance and correctness issues > > > Key: BEAM-7866 > URL: https://issues.apache.org/jira/browse/BEAM-7866 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Eugene Kirpichov >Assignee: Yichi Zhang >Priority: Blocker > Fix For: 2.15.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py > splits the query result by computing number of results in constructor, and > then in each reader re-executing the whole query and getting an index > sub-range of those results. > This is broken in several critical ways: > - The order of query results returned by find() is not necessarily > deterministic, so the i
[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues
[ https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288245&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288245 ] ASF GitHub Bot logged work on BEAM-7866: Author: ASF GitHub Bot Created on: 02/Aug/19 20:54 Start Date: 02/Aug/19 20:54 Worklog Time Spent: 10m Work Description: y1chi commented on pull request #9233: [BEAM-7866] Fix python ReadFromMongoDB potential data loss issue URL: https://github.com/apache/beam/pull/9233#discussion_r310294780 ## File path: sdks/python/apache_beam/io/mongodbio.py ## @@ -139,50 +143,77 @@ def __init__(self, self.filter = filter self.projection = projection self.spec = extra_client_params -self.doc_count = self._get_document_count() -self.avg_doc_size = self._get_avg_document_size() -self.client = None def estimate_size(self): -return self.avg_doc_size * self.doc_count +with MongoClient(self.uri, **self.spec) as client: + size = client[self.db].command('collstats', self.coll).get('size') + if size is None or size <= 0: +raise ValueError('Collection %s not found or total doc size is ' + 'incorrect' % self.coll) + return size def split(self, desired_bundle_size, start_position=None, stop_position=None): # use document cursor index as the start and stop positions if start_position is None: - start_position = 0 + epoch = datetime.datetime(1970, 1, 1) Review comment: makes sense, thanks for the suggest. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288245) Time Spent: 2h 10m (was: 2h) > Python MongoDB IO performance and correctness issues > > > Key: BEAM-7866 > URL: https://issues.apache.org/jira/browse/BEAM-7866 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Eugene Kirpichov >Assignee: Yichi Zhang >Priority: Blocker > Fix For: 2.15.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py > splits the query result by computing number of results in constructor, and > then in each reader re-executing the whole query and getting an index > sub-range of those results. > This is broken in several critical ways: > - The order of query results returned by find() is not necessarily > deterministic, so the idea of index ranges on it is meaningless: each shard > may basically get random, possibly overlapping subsets of the total results > - Even if you add order by `_id`, the database may be changing concurrently > to reading and splitting. E.g. if the database contained documents with ids > 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the > assumption that these shards would contain respectively 10 20 30, and 40 50), > and then suppose shard 10 20 30 is read and then document 25 is inserted - > then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and > document 25 is lost. > - Every shard re-executes the query and skips the first start_offset items, > which in total is quadratic complexity > - The query is first executed in the constructor in order to count results, > which 1) means the constructor can be super slow and 2) it won't work at all > if the database is unavailable at the time the pipeline is constructed (e.g. > if this is a template). > Unfortunately, none of these issues are caught by SourceTestUtils: this class > has extensive coverage with it, and the tests pass. This is because the tests > return the same results in the same order. I don't know how to catch this > automatically, and I don't know how to catch the performance issue > automatically, but these would all be important follow-up items after the > actual fix. > CC: [~chamikara] as reviewer. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-562) DoFn Reuse: Add new DoFn setup and teardown to python SDK
[ https://issues.apache.org/jira/browse/BEAM-562?focusedWorklogId=288243&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288243 ] ASF GitHub Bot logged work on BEAM-562: --- Author: ASF GitHub Bot Created on: 02/Aug/19 20:45 Start Date: 02/Aug/19 20:45 Worklog Time Spent: 10m Work Description: NikeNano commented on issue #7994: [BEAM-562] Add DoFn.setup and DoFn.teardown to Python SDK URL: https://github.com/apache/beam/pull/7994#issuecomment-517839074 OK, thanks @aaltay. Will investigate further and file an issue if i don't get it to work. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288243) Time Spent: 11h 20m (was: 11h 10m) > DoFn Reuse: Add new DoFn setup and teardown to python SDK > - > > Key: BEAM-562 > URL: https://issues.apache.org/jira/browse/BEAM-562 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Ahmet Altay >Assignee: Yifan Mai >Priority: Major > Labels: sdk-consistency > Fix For: 2.14.0 > > Time Spent: 11h 20m > Remaining Estimate: 0h > > Java SDK added setup and teardown methods to the DoFns. This makes DoFns > reusable and provide performance improvements. Python SDK should add support > for these new DoFn methods: > Proposal doc: > https://docs.google.com/document/d/1LLQqggSePURt3XavKBGV7SZJYQ4NW8yCu63lBchzMRk/edit?ts=5771458f# -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-562) DoFn Reuse: Add new DoFn setup and teardown to python SDK
[ https://issues.apache.org/jira/browse/BEAM-562?focusedWorklogId=288242&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288242 ] ASF GitHub Bot logged work on BEAM-562: --- Author: ASF GitHub Bot Created on: 02/Aug/19 20:44 Start Date: 02/Aug/19 20:44 Worklog Time Spent: 10m Work Description: aaltay commented on issue #7994: [BEAM-562] Add DoFn.setup and DoFn.teardown to Python SDK URL: https://github.com/apache/beam/pull/7994#issuecomment-517838714 @NikeNano it should work with all dofns regardless of the types of sources. If that is not working, please file an issue. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288242) Time Spent: 11h 10m (was: 11h) > DoFn Reuse: Add new DoFn setup and teardown to python SDK > - > > Key: BEAM-562 > URL: https://issues.apache.org/jira/browse/BEAM-562 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Ahmet Altay >Assignee: Yifan Mai >Priority: Major > Labels: sdk-consistency > Fix For: 2.14.0 > > Time Spent: 11h 10m > Remaining Estimate: 0h > > Java SDK added setup and teardown methods to the DoFns. This makes DoFns > reusable and provide performance improvements. Python SDK should add support > for these new DoFn methods: > Proposal doc: > https://docs.google.com/document/d/1LLQqggSePURt3XavKBGV7SZJYQ4NW8yCu63lBchzMRk/edit?ts=5771458f# -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-562) DoFn Reuse: Add new DoFn setup and teardown to python SDK
[ https://issues.apache.org/jira/browse/BEAM-562?focusedWorklogId=288241&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288241 ] ASF GitHub Bot logged work on BEAM-562: --- Author: ASF GitHub Bot Created on: 02/Aug/19 20:40 Start Date: 02/Aug/19 20:40 Worklog Time Spent: 10m Work Description: NikeNano commented on issue #7994: [BEAM-562] Add DoFn.setup and DoFn.teardown to Python SDK URL: https://github.com/apache/beam/pull/7994#issuecomment-517837763 @yifanmai is this also expected to work for unbounded sources? I don't get it to work when reading from pubsub This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288241) Time Spent: 11h (was: 10h 50m) > DoFn Reuse: Add new DoFn setup and teardown to python SDK > - > > Key: BEAM-562 > URL: https://issues.apache.org/jira/browse/BEAM-562 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Ahmet Altay >Assignee: Yifan Mai >Priority: Major > Labels: sdk-consistency > Fix For: 2.14.0 > > Time Spent: 11h > Remaining Estimate: 0h > > Java SDK added setup and teardown methods to the DoFns. This makes DoFns > reusable and provide performance improvements. Python SDK should add support > for these new DoFn methods: > Proposal doc: > https://docs.google.com/document/d/1LLQqggSePURt3XavKBGV7SZJYQ4NW8yCu63lBchzMRk/edit?ts=5771458f# -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (BEAM-7878) Refine Spark runner dependencies
[ https://issues.apache.org/jira/browse/BEAM-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía resolved BEAM-7878. Resolution: Fixed > Refine Spark runner dependencies > > > Key: BEAM-7878 > URL: https://issues.apache.org/jira/browse/BEAM-7878 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > Fix For: 2.16.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > The Spark runner has more dependencies than it needs: > * The jackson_module_scala dependency is marked as compile but it is only > needed at runtime so it ends up leaking this module. > * The dropwizard, scala-library and commons-compress dependencies are > declared but unused. > * The Kafka dependency version is not aligned with the rest of Beam. > * The Kafka and zookeeper dependencies are used only for the tests but > currently are provided -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=288236&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288236 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 02/Aug/19 20:35 Start Date: 02/Aug/19 20:35 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r310289052 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountInitFn.java ## @@ -0,0 +1,152 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.zetasketch; + +import static org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument; + +import com.google.zetasketch.HyperLogLogPlusPlus; +import com.google.zetasketch.shaded.com.google.protobuf.ByteString; +import org.apache.beam.sdk.coders.Coder; +import org.apache.beam.sdk.coders.CoderRegistry; +import org.apache.beam.sdk.transforms.Combine; + +/** + * {@link Combine.CombineFn} for the {@link HllCount.Init} combiner. + * + * @param type of input values to the function + * @param type of the HLL++ sketch to compute + */ +abstract class HllCountInitFn +extends Combine.CombineFn, byte[]> { + + private int precision; + + private HllCountInitFn() { +setPrecision(HllCount.DEFAULT_PRECISION); + } + + int getPrecision() { +return precision; + } + + void setPrecision(int precision) { Review comment: Done! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288236) Time Spent: 10h 10m (was: 10h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 10h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=288235&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288235 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 02/Aug/19 20:34 Start Date: 02/Aug/19 20:34 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r310288860 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java ## @@ -0,0 +1,350 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.zetasketch; + +import com.google.zetasketch.HyperLogLogPlusPlus; +import org.apache.beam.sdk.annotations.Experimental; +import org.apache.beam.sdk.transforms.Combine; +import org.apache.beam.sdk.transforms.DoFn; +import org.apache.beam.sdk.transforms.PTransform; +import org.apache.beam.sdk.transforms.ParDo; +import org.apache.beam.sdk.values.KV; +import org.apache.beam.sdk.values.PCollection; + +/** + * {@code PTransform}s to compute HyperLogLogPlusPlus (HLL++) sketches on data streams based on the + * https://github.com/google/zetasketch";>ZetaSketch implementation. + * + * HLL++ is an algorithm implemented by Google that estimates the count of distinct elements in a + * data stream. HLL++ requires significantly less memory than the linear memory needed for exact + * computation, at the cost of a small error. Cardinalities of arbitrary breakdowns can be computed + * using the HLL++ sketch. See this http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf";>published + * paper for details about the algorithm. + * + * HLL++ functions are also supported in https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions";>Google Cloud + * BigQuery. Using the {@code HllCount PTransform}s makes the interoperation with BigQuery + * easier. + * + * For detailed design of this class, see https://s.apache.org/hll-in-beam. + * + * Examples + * + * Example 1: Create long-type sketch for a {@code PCollection} and specify precision + * + * {@code + * PCollection input = ...; + * int p = ...; + * PCollection sketch = input.apply(HllCount.Init.longSketch().withPrecision(p).globally()); + * } + * + * Example 2: Create bytes-type sketch for a {@code PCollection>} + * + * {@code + * PCollection> input = ...; + * PCollection> sketch = input.apply(HllCount.Init.bytesSketch().perKey()); + * } + * + * Example 3: Merge existing sketches in a {@code PCollection} into a new one + * + * {@code + * PCollection sketches = ...; + * PCollection mergedSketch = sketches.apply(HllCount.MergePartial.globally()); + * } + * + * Example 4: Estimates the count of distinct elements in a {@code PCollection} + * + * {@code + * PCollection input = ...; + * PCollection countDistinct = + * input.apply(HllCount.Init.stringSketch().globally()).apply(HllCount.Extract.globally()); + * } + * + * Note: Currently HllCount does not work on FnAPI workers. See https://issues.apache.org/jira/browse/BEAM-7879";>Jira ticket [BEAM-7879]. + */ +@Experimental +public final class HllCount { + + public static final int MINIMUM_PRECISION = HyperLogLogPlusPlus.MINIMUM_PRECISION; Review comment: Ah, thanks for catching this. Now I have set up javadoc correctly such that the actual numbers can be seen, and I think it is clear now that those are for users. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288235) Time Spent: 10h (was: 9h 50m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation >
[jira] [Commented] (BEAM-7049) BeamUnionRel should work on mutiple input
[ https://issues.apache.org/jira/browse/BEAM-7049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899194#comment-16899194 ] sridhar Reddy commented on BEAM-7049: - [~amaliujia] Thanks for the update. I will start working on this. > BeamUnionRel should work on mutiple input > -- > > Key: BEAM-7049 > URL: https://issues.apache.org/jira/browse/BEAM-7049 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Rui Wang >Assignee: sridhar Reddy >Priority: Major > > BeamUnionRel assumes inputs are two and rejects more. So `a UNION b UNION c` > will have to be created as UNION(a, UNION(b, c)) and have two shuffles. If > BeamUnionRel can handle multiple shuffles, we will have only one shuffle -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Reopened] (BEAM-7878) Refine Spark runner dependencies
[ https://issues.apache.org/jira/browse/BEAM-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía reopened BEAM-7878: > Refine Spark runner dependencies > > > Key: BEAM-7878 > URL: https://issues.apache.org/jira/browse/BEAM-7878 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > Fix For: 2.16.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > The Spark runner has more dependencies than it needs: > * The jackson_module_scala dependency is marked as compile but it is only > needed at runtime so it ends up leaking this module. > * The dropwizard, scala-library and commons-compress dependencies are > declared but unused. > * The Kafka dependency version is not aligned with the rest of Beam. > * The Kafka and zookeeper dependencies are used only for the tests but > currently are provided -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (BEAM-7878) Refine Spark runner dependencies
[ https://issues.apache.org/jira/browse/BEAM-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated BEAM-7878: --- Status: Open (was: Triage Needed) > Refine Spark runner dependencies > > > Key: BEAM-7878 > URL: https://issues.apache.org/jira/browse/BEAM-7878 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > Fix For: 2.16.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > The Spark runner has more dependencies than it needs: > * The jackson_module_scala dependency is marked as compile but it is only > needed at runtime so it ends up leaking this module. > * The dropwizard, scala-library and commons-compress dependencies are > declared but unused. > * The Kafka dependency version is not aligned with the rest of Beam. > * The Kafka and zookeeper dependencies are used only for the tests but > currently are provided -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (BEAM-7049) BeamUnionRel should work on mutiple input
[ https://issues.apache.org/jira/browse/BEAM-7049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899191#comment-16899191 ] Rui Wang edited comment on BEAM-7049 at 8/2/19 8:29 PM: [~sridharG] A good start query is " SELECT 1 UNION 2 UNION 3". good entry pointers are: https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamUnionRule.java#L43 https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamUnionRel.java#L66 https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamSetOperatorRelBase.java#L61 The expected resolution is you can use multiple tags to GoGBK multiple PCollections through the same shuffle: https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamSetOperatorRelBase.java#L85 And then you can improve binary implementation to make it handle more than two PCollection here: https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/transform/BeamSetOperatorsTransforms.java#L60 was (Author: amaliujia): A good start query is " SELECT 1 UNION 2 UNION 3". good entry pointers are: https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamUnionRule.java#L43 https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamUnionRel.java#L66 https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamSetOperatorRelBase.java#L61 The expected resolution is you can use multiple tags to GoGBK multiple PCollections through the same shuffle: https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamSetOperatorRelBase.java#L85 And then you can improve binary implementation to make it handle more than two PCollection here: https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/transform/BeamSetOperatorsTransforms.java#L60 > BeamUnionRel should work on mutiple input > -- > > Key: BEAM-7049 > URL: https://issues.apache.org/jira/browse/BEAM-7049 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Rui Wang >Assignee: sridhar Reddy >Priority: Major > > BeamUnionRel assumes inputs are two and rejects more. So `a UNION b UNION c` > will have to be created as UNION(a, UNION(b, c)) and have two shuffles. If > BeamUnionRel can handle multiple shuffles, we will have only one shuffle -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7049) BeamUnionRel should work on mutiple input
[ https://issues.apache.org/jira/browse/BEAM-7049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899191#comment-16899191 ] Rui Wang commented on BEAM-7049: A good start query is " SELECT 1 UNION 2 UNION 3". good entry pointers are: https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamUnionRule.java#L43 https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamUnionRel.java#L66 https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamSetOperatorRelBase.java#L61 The expected resolution is you can use multiple tags to GoGBK multiple PCollections through the same shuffle: https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamSetOperatorRelBase.java#L85 And then you can improve binary implementation to make it handle more than two PCollection here: https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/transform/BeamSetOperatorsTransforms.java#L60 > BeamUnionRel should work on mutiple input > -- > > Key: BEAM-7049 > URL: https://issues.apache.org/jira/browse/BEAM-7049 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Rui Wang >Assignee: sridhar Reddy >Priority: Major > > BeamUnionRel assumes inputs are two and rejects more. So `a UNION b UNION c` > will have to be created as UNION(a, UNION(b, c)) and have two shuffles. If > BeamUnionRel can handle multiple shuffles, we will have only one shuffle -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=288231&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288231 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 02/Aug/19 20:27 Start Date: 02/Aug/19 20:27 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r310286718 ## File path: sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java ## @@ -200,6 +200,11 @@ * * } * + * Consider using the {@code HllCount.Init} transform in the {@code zetasketch} extension module if + * you need to create sketches with format compatible with Google Cloud BigQuery. For more details Review comment: Done! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288231) Time Spent: 9h 50m (was: 9h 40m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 9h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (BEAM-7049) BeamUnionRel should work on mutiple input
[ https://issues.apache.org/jira/browse/BEAM-7049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang reassigned BEAM-7049: -- Assignee: sridhar Reddy > BeamUnionRel should work on mutiple input > -- > > Key: BEAM-7049 > URL: https://issues.apache.org/jira/browse/BEAM-7049 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Rui Wang >Assignee: sridhar Reddy >Priority: Major > > BeamUnionRel assumes inputs are two and rejects more. So `a UNION b UNION c` > will have to be created as UNION(a, UNION(b, c)) and have two shuffles. If > BeamUnionRel can handle multiple shuffles, we will have only one shuffle -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7846) add test for BEAM-7689
[ https://issues.apache.org/jira/browse/BEAM-7846?focusedWorklogId=288226&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288226 ] ASF GitHub Bot logged work on BEAM-7846: Author: ASF GitHub Bot Created on: 02/Aug/19 20:18 Start Date: 02/Aug/19 20:18 Worklog Time Spent: 10m Work Description: ihji commented on issue #9228: [BEAM-7846] add test for BEAM-7689 URL: https://github.com/apache/beam/pull/9228#issuecomment-517831685 run java precommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288226) Time Spent: 0.5h (was: 20m) > add test for BEAM-7689 > -- > > Key: BEAM-7846 > URL: https://issues.apache.org/jira/browse/BEAM-7846 > Project: Beam > Issue Type: Bug > Components: io-java-files, io-python-files >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > add test for BEAM-7689 and also Python counterpart so make sure that it won't > come back :) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues
[ https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288222&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288222 ] ASF GitHub Bot logged work on BEAM-7866: Author: ASF GitHub Bot Created on: 02/Aug/19 20:12 Start Date: 02/Aug/19 20:12 Worklog Time Spent: 10m Work Description: jkff commented on pull request #9233: [BEAM-7866] Fix python ReadFromMongoDB potential data loss issue URL: https://github.com/apache/beam/pull/9233#discussion_r310281527 ## File path: sdks/python/apache_beam/io/mongodbio_test.py ## @@ -30,38 +32,102 @@ from apache_beam.io.mongodbio import _BoundedMongoSource from apache_beam.io.mongodbio import _GenerateObjectIdFn from apache_beam.io.mongodbio import _MongoSink +from apache_beam.io.mongodbio import _ObjectIdRangeTracker from apache_beam.io.mongodbio import _WriteMongoFn from apache_beam.testing.test_pipeline import TestPipeline from apache_beam.testing.util import assert_that from apache_beam.testing.util import equal_to +class _MockMongoColl(object): + def __init__(self, docs): +self.docs = docs + + def _filter(self, filter): +match = [] +if not filter: + return self +start = filter['_id'].get('$gte') +end = filter['_id'].get('$lt') +for doc in self.docs: + if start and doc['_id'] < start: +continue + if end and doc['_id'] >= end: +continue + match.append(doc) +return match + + def find(self, filter=None, **kwargs): +return _MockMongoColl(self._filter(filter)) + + def sort(self, sort_items): +key, order = sort_items[0] +self.docs = sorted(self.docs, + key=lambda x: x[key], + reverse=(order != ASCENDING)) +return self + + def limit(self, num): +return _MockMongoColl(self.docs[0:num]) + + def count_documents(self, filter): +return len(self._filter(filter)) + + def __getitem__(self, item): +return self.docs[item] + + class MongoSourceTest(unittest.TestCase): - @mock.patch('apache_beam.io.mongodbio._BoundedMongoSource' - '._get_document_count') - @mock.patch('apache_beam.io.mongodbio._BoundedMongoSource' - '._get_avg_document_size') - def setUp(self, mock_size, mock_count): -mock_size.return_value = 10 -mock_count.return_value = 5 + @mock.patch('apache_beam.io.mongodbio.MongoClient') + def setUp(self, mock_client): +mock_client.return_value.__enter__.return_value.__getitem__ \ + .return_value.command.return_value = {'size': 5, 'avgSize': 1} +self._ids = [ +objectid.ObjectId.from_datetime( +datetime.datetime(year=2020, month=i + 1, day=i + 1)) +for i in range(5) +] +self._docs = [{'_id': self._ids[i], 'x': i} for i in range(len(self._ids))] + self.mongo_source = _BoundedMongoSource('mongodb://test', 'testdb', 'testcoll') - def test_estimate_size(self): -self.assertEqual(self.mongo_source.estimate_size(), 50) + def get_split(self, command, ns, min, max, maxChunkSize, **kwargs): Review comment: * What does this do? It's not quite clear from the method body * maxChunkSize should probably be called max_chunk_size This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288222) Time Spent: 1h 50m (was: 1h 40m) > Python MongoDB IO performance and correctness issues > > > Key: BEAM-7866 > URL: https://issues.apache.org/jira/browse/BEAM-7866 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Eugene Kirpichov >Assignee: Yichi Zhang >Priority: Blocker > Fix For: 2.15.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py > splits the query result by computing number of results in constructor, and > then in each reader re-executing the whole query and getting an index > sub-range of those results. > This is broken in several critical ways: > - The order of query results returned by find() is not necessarily > deterministic, so the idea of index ranges on it is meaningless: each shard > may basically get random, possibly overlapping subsets of the total results > - Even if you add order by `_id`, the database may be changing concurrently > to reading and splitting. E.g. if the database contained documents with ids
[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues
[ https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288221&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288221 ] ASF GitHub Bot logged work on BEAM-7866: Author: ASF GitHub Bot Created on: 02/Aug/19 20:12 Start Date: 02/Aug/19 20:12 Worklog Time Spent: 10m Work Description: jkff commented on pull request #9233: [BEAM-7866] Fix python ReadFromMongoDB potential data loss issue URL: https://github.com/apache/beam/pull/9233#discussion_r310280167 ## File path: sdks/python/apache_beam/io/mongodbio.py ## @@ -194,18 +225,72 @@ def display_data(self): res['mongo_client_spec'] = self.spec return res - def _get_avg_document_size(self): + def _get_split_keys(self, desired_chunk_size, start_pos, end_pos): +# if desired chunk size smaller than 1mb, use mongodb default split size of +# 1mb +if desired_chunk_size < 1: + desired_chunk_size = 1 +if start_pos >= end_pos: + # single document not splittable + return [] with MongoClient(self.uri, **self.spec) as client: - size = client[self.db].command('collstats', self.coll).get('avgObjSize') - if size is None or size <= 0: -raise ValueError( -'Collection %s not found or average doc size is ' -'incorrect', self.coll) - return size - - def _get_document_count(self): + name_space = '%s.%s' % (self.db, self.coll) + return (client[self.db].command( + 'splitVector', + name_space, + keyPattern={'_id': 1}, + min={'_id': start_pos}, + max={'_id': end_pos}, + maxChunkSize=desired_chunk_size)['splitKeys']) + + def _get_last_document_id(self): with MongoClient(self.uri, **self.spec) as client: - return max(client[self.db][self.coll].count_documents(self.filter), 0) + cursor = client[self.db][self.coll].find(filter={}, projection=[]).sort([ + ('_id', DESCENDING) + ]).limit(1) + try: +return cursor[0]['_id'] + except IndexError: +raise ValueError('Empty Mongodb collection') + + +class _ObjectIdRangeTracker(OrderedPositionRangeTracker): + """RangeTracker for tracking mongodb _id of bson ObjectId type.""" + + def _id_to_int(self, id): +# id object is bytes type with size of 12 +ints = struct.unpack('>III', id.binary) +return 2**64 * ints[0] + 2**32 * ints[1] + ints[2] + + def _int_to_id(self, numbers): Review comment: This kind of bignum arithmetic is extremely hard to get right, please make sure the unit tests cover enough corner cases. I would recommend to add some randomized property testing. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288221) Time Spent: 1h 40m (was: 1.5h) > Python MongoDB IO performance and correctness issues > > > Key: BEAM-7866 > URL: https://issues.apache.org/jira/browse/BEAM-7866 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Eugene Kirpichov >Assignee: Yichi Zhang >Priority: Blocker > Fix For: 2.15.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py > splits the query result by computing number of results in constructor, and > then in each reader re-executing the whole query and getting an index > sub-range of those results. > This is broken in several critical ways: > - The order of query results returned by find() is not necessarily > deterministic, so the idea of index ranges on it is meaningless: each shard > may basically get random, possibly overlapping subsets of the total results > - Even if you add order by `_id`, the database may be changing concurrently > to reading and splitting. E.g. if the database contained documents with ids > 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the > assumption that these shards would contain respectively 10 20 30, and 40 50), > and then suppose shard 10 20 30 is read and then document 25 is inserted - > then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and > document 25 is lost. > - Every shard re-executes the query and skips the first start_offset items, > which in total is quadratic complexity > - The query is first executed in the constructor in order to count results, > which 1) means the constructor can be super slow and 2) it wo
[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues
[ https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288218&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288218 ] ASF GitHub Bot logged work on BEAM-7866: Author: ASF GitHub Bot Created on: 02/Aug/19 20:12 Start Date: 02/Aug/19 20:12 Worklog Time Spent: 10m Work Description: jkff commented on pull request #9233: [BEAM-7866] Fix python ReadFromMongoDB potential data loss issue URL: https://github.com/apache/beam/pull/9233#discussion_r310278529 ## File path: sdks/python/apache_beam/io/mongodbio.py ## @@ -139,50 +143,77 @@ def __init__(self, self.filter = filter self.projection = projection self.spec = extra_client_params -self.doc_count = self._get_document_count() -self.avg_doc_size = self._get_avg_document_size() -self.client = None def estimate_size(self): -return self.avg_doc_size * self.doc_count +with MongoClient(self.uri, **self.spec) as client: + size = client[self.db].command('collstats', self.coll).get('size') + if size is None or size <= 0: +raise ValueError('Collection %s not found or total doc size is ' + 'incorrect' % self.coll) + return size def split(self, desired_bundle_size, start_position=None, stop_position=None): # use document cursor index as the start and stop positions if start_position is None: - start_position = 0 + epoch = datetime.datetime(1970, 1, 1) + start_position = objectid.ObjectId.from_datetime(epoch) if stop_position is None: - stop_position = self.doc_count + last_doc_id = self._get_last_document_id() + # add one sec to make sure the last document is not excluded + last_timestamp_plus_one_sec = (last_doc_id.generation_time + + datetime.timedelta(seconds=1)) + stop_position = objectid.ObjectId.from_datetime( + last_timestamp_plus_one_sec) -# get an estimate on how many documents should be included in a split batch -desired_bundle_count = desired_bundle_size // self.avg_doc_size +desired_bundle_size_in_mb = desired_bundle_size // 1024 // 1024 +split_keys = self._get_split_keys(desired_bundle_size_in_mb, start_position, + stop_position) bundle_start = start_position -while bundle_start < stop_position: - bundle_end = min(stop_position, bundle_start + desired_bundle_count) - yield iobase.SourceBundle(weight=bundle_end - bundle_start, +for split_key_id in split_keys: + bundle_end = min(stop_position, split_key_id) + if bundle_start is None and bundle_start < stop_position: +return + yield iobase.SourceBundle(weight=desired_bundle_size_in_mb, source=self, start_position=bundle_start, stop_position=bundle_end) bundle_start = bundle_end +# add range of last split_key to stop_position +if bundle_start < stop_position: + yield iobase.SourceBundle(weight=desired_bundle_size_in_mb, +source=self, +start_position=bundle_start, +stop_position=stop_position) def get_range_tracker(self, start_position, stop_position): if start_position is None: - start_position = 0 + epoch = datetime.datetime(1970, 1, 1) Review comment: When can start_position and stop_position be None in get_range_tracker? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288218) Time Spent: 1.5h (was: 1h 20m) > Python MongoDB IO performance and correctness issues > > > Key: BEAM-7866 > URL: https://issues.apache.org/jira/browse/BEAM-7866 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Eugene Kirpichov >Assignee: Yichi Zhang >Priority: Blocker > Fix For: 2.15.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py > splits the query result by computing number of results in constructor, and > then in each reader re-executing the whole query and getting an index > sub-range of those results. > This is broken in several critical ways: > - The order of query results returned by find() is not necessarily > determ
[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues
[ https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288223&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288223 ] ASF GitHub Bot logged work on BEAM-7866: Author: ASF GitHub Bot Created on: 02/Aug/19 20:12 Start Date: 02/Aug/19 20:12 Worklog Time Spent: 10m Work Description: jkff commented on pull request #9233: [BEAM-7866] Fix python ReadFromMongoDB potential data loss issue URL: https://github.com/apache/beam/pull/9233#discussion_r310281025 ## File path: sdks/python/apache_beam/io/mongodbio_test.py ## @@ -30,38 +32,102 @@ from apache_beam.io.mongodbio import _BoundedMongoSource from apache_beam.io.mongodbio import _GenerateObjectIdFn from apache_beam.io.mongodbio import _MongoSink +from apache_beam.io.mongodbio import _ObjectIdRangeTracker from apache_beam.io.mongodbio import _WriteMongoFn from apache_beam.testing.test_pipeline import TestPipeline from apache_beam.testing.util import assert_that from apache_beam.testing.util import equal_to +class _MockMongoColl(object): + def __init__(self, docs): +self.docs = docs + + def _filter(self, filter): +match = [] +if not filter: + return self +start = filter['_id'].get('$gte') +end = filter['_id'].get('$lt') +for doc in self.docs: + if start and doc['_id'] < start: +continue + if end and doc['_id'] >= end: +continue + match.append(doc) +return match + + def find(self, filter=None, **kwargs): +return _MockMongoColl(self._filter(filter)) + + def sort(self, sort_items): +key, order = sort_items[0] +self.docs = sorted(self.docs, + key=lambda x: x[key], + reverse=(order != ASCENDING)) +return self + + def limit(self, num): +return _MockMongoColl(self.docs[0:num]) + + def count_documents(self, filter): +return len(self._filter(filter)) + + def __getitem__(self, item): +return self.docs[item] + + class MongoSourceTest(unittest.TestCase): - @mock.patch('apache_beam.io.mongodbio._BoundedMongoSource' - '._get_document_count') - @mock.patch('apache_beam.io.mongodbio._BoundedMongoSource' - '._get_avg_document_size') - def setUp(self, mock_size, mock_count): -mock_size.return_value = 10 -mock_count.return_value = 5 + @mock.patch('apache_beam.io.mongodbio.MongoClient') + def setUp(self, mock_client): +mock_client.return_value.__enter__.return_value.__getitem__ \ Review comment: It's hard to follow what this line is doing, could you add a comment? Ditto for other similar statements. I'm wondering if mocking is even the right approach here. Maybe create a fake instead? Seems like that could be much more readable and less fragile w.r.t. precise order of method calls. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288223) Time Spent: 2h (was: 1h 50m) > Python MongoDB IO performance and correctness issues > > > Key: BEAM-7866 > URL: https://issues.apache.org/jira/browse/BEAM-7866 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Eugene Kirpichov >Assignee: Yichi Zhang >Priority: Blocker > Fix For: 2.15.0 > > Time Spent: 2h > Remaining Estimate: 0h > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py > splits the query result by computing number of results in constructor, and > then in each reader re-executing the whole query and getting an index > sub-range of those results. > This is broken in several critical ways: > - The order of query results returned by find() is not necessarily > deterministic, so the idea of index ranges on it is meaningless: each shard > may basically get random, possibly overlapping subsets of the total results > - Even if you add order by `_id`, the database may be changing concurrently > to reading and splitting. E.g. if the database contained documents with ids > 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the > assumption that these shards would contain respectively 10 20 30, and 40 50), > and then suppose shard 10 20 30 is read and then document 25 is inserted - > then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and > document 25 is lost. > - Every shard re-executes the query and skips the first start_offset items, > which in total is quadratic
[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues
[ https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288217&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288217 ] ASF GitHub Bot logged work on BEAM-7866: Author: ASF GitHub Bot Created on: 02/Aug/19 20:12 Start Date: 02/Aug/19 20:12 Worklog Time Spent: 10m Work Description: jkff commented on pull request #9233: [BEAM-7866] Fix python ReadFromMongoDB potential data loss issue URL: https://github.com/apache/beam/pull/9233#discussion_r310278966 ## File path: sdks/python/apache_beam/io/mongodbio.py ## @@ -139,50 +143,77 @@ def __init__(self, self.filter = filter self.projection = projection self.spec = extra_client_params -self.doc_count = self._get_document_count() -self.avg_doc_size = self._get_avg_document_size() -self.client = None def estimate_size(self): -return self.avg_doc_size * self.doc_count +with MongoClient(self.uri, **self.spec) as client: + size = client[self.db].command('collstats', self.coll).get('size') + if size is None or size <= 0: +raise ValueError('Collection %s not found or total doc size is ' + 'incorrect' % self.coll) + return size def split(self, desired_bundle_size, start_position=None, stop_position=None): # use document cursor index as the start and stop positions if start_position is None: - start_position = 0 + epoch = datetime.datetime(1970, 1, 1) + start_position = objectid.ObjectId.from_datetime(epoch) if stop_position is None: - stop_position = self.doc_count + last_doc_id = self._get_last_document_id() + # add one sec to make sure the last document is not excluded + last_timestamp_plus_one_sec = (last_doc_id.generation_time + + datetime.timedelta(seconds=1)) + stop_position = objectid.ObjectId.from_datetime( + last_timestamp_plus_one_sec) -# get an estimate on how many documents should be included in a split batch -desired_bundle_count = desired_bundle_size // self.avg_doc_size +desired_bundle_size_in_mb = desired_bundle_size // 1024 // 1024 +split_keys = self._get_split_keys(desired_bundle_size_in_mb, start_position, + stop_position) bundle_start = start_position -while bundle_start < stop_position: - bundle_end = min(stop_position, bundle_start + desired_bundle_count) - yield iobase.SourceBundle(weight=bundle_end - bundle_start, +for split_key_id in split_keys: + bundle_end = min(stop_position, split_key_id) + if bundle_start is None and bundle_start < stop_position: +return + yield iobase.SourceBundle(weight=desired_bundle_size_in_mb, source=self, start_position=bundle_start, stop_position=bundle_end) bundle_start = bundle_end +# add range of last split_key to stop_position +if bundle_start < stop_position: + yield iobase.SourceBundle(weight=desired_bundle_size_in_mb, +source=self, +start_position=bundle_start, +stop_position=stop_position) def get_range_tracker(self, start_position, stop_position): if start_position is None: - start_position = 0 + epoch = datetime.datetime(1970, 1, 1) + start_position = objectid.ObjectId.from_datetime(epoch) if stop_position is None: - stop_position = self.doc_count -return OffsetRangeTracker(start_position, stop_position) + last_doc_id = self._get_last_document_id() + # add one sec to make sure the last document is not excluded + last_timestamp_plus_one_sec = (last_doc_id.generation_time + + datetime.timedelta(seconds=1)) + stop_position = objectid.ObjectId.from_datetime( + last_timestamp_plus_one_sec) +return _ObjectIdRangeTracker(start_position, stop_position) def read(self, range_tracker): with MongoClient(self.uri, **self.spec) as client: - # docs is a MongoDB Cursor - docs = client[self.db][self.coll].find( - filter=self.filter, projection=self.projection - )[range_tracker.start_position():range_tracker.stop_position()] - for index in range(range_tracker.start_position(), - range_tracker.stop_position()): -if not range_tracker.try_claim(index): + all_filters = self.filter + all_filters.update({ + '_id': { + '$gte': range_tracker.start_position(), Review comment: What if the original filter already has gte/lt conditions on _id? I think it's fine to prohibit it (at construction time); but you can also take it into account and infer the start
[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues
[ https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288224&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288224 ] ASF GitHub Bot logged work on BEAM-7866: Author: ASF GitHub Bot Created on: 02/Aug/19 20:12 Start Date: 02/Aug/19 20:12 Worklog Time Spent: 10m Work Description: jkff commented on pull request #9233: [BEAM-7866] Fix python ReadFromMongoDB potential data loss issue URL: https://github.com/apache/beam/pull/9233#discussion_r310279232 ## File path: sdks/python/apache_beam/io/mongodbio.py ## @@ -194,18 +225,72 @@ def display_data(self): res['mongo_client_spec'] = self.spec return res - def _get_avg_document_size(self): + def _get_split_keys(self, desired_chunk_size, start_pos, end_pos): +# if desired chunk size smaller than 1mb, use mongodb default split size of +# 1mb +if desired_chunk_size < 1: Review comment: What are the units of desired_chunk_size? The comment implies mb? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288224) Time Spent: 2h (was: 1h 50m) > Python MongoDB IO performance and correctness issues > > > Key: BEAM-7866 > URL: https://issues.apache.org/jira/browse/BEAM-7866 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Eugene Kirpichov >Assignee: Yichi Zhang >Priority: Blocker > Fix For: 2.15.0 > > Time Spent: 2h > Remaining Estimate: 0h > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py > splits the query result by computing number of results in constructor, and > then in each reader re-executing the whole query and getting an index > sub-range of those results. > This is broken in several critical ways: > - The order of query results returned by find() is not necessarily > deterministic, so the idea of index ranges on it is meaningless: each shard > may basically get random, possibly overlapping subsets of the total results > - Even if you add order by `_id`, the database may be changing concurrently > to reading and splitting. E.g. if the database contained documents with ids > 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the > assumption that these shards would contain respectively 10 20 30, and 40 50), > and then suppose shard 10 20 30 is read and then document 25 is inserted - > then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and > document 25 is lost. > - Every shard re-executes the query and skips the first start_offset items, > which in total is quadratic complexity > - The query is first executed in the constructor in order to count results, > which 1) means the constructor can be super slow and 2) it won't work at all > if the database is unavailable at the time the pipeline is constructed (e.g. > if this is a template). > Unfortunately, none of these issues are caught by SourceTestUtils: this class > has extensive coverage with it, and the tests pass. This is because the tests > return the same results in the same order. I don't know how to catch this > automatically, and I don't know how to catch the performance issue > automatically, but these would all be important follow-up items after the > actual fix. > CC: [~chamikara] as reviewer. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues
[ https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288219&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288219 ] ASF GitHub Bot logged work on BEAM-7866: Author: ASF GitHub Bot Created on: 02/Aug/19 20:12 Start Date: 02/Aug/19 20:12 Worklog Time Spent: 10m Work Description: jkff commented on pull request #9233: [BEAM-7866] Fix python ReadFromMongoDB potential data loss issue URL: https://github.com/apache/beam/pull/9233#discussion_r310280815 ## File path: sdks/python/apache_beam/io/mongodbio_test.py ## @@ -30,38 +32,102 @@ from apache_beam.io.mongodbio import _BoundedMongoSource from apache_beam.io.mongodbio import _GenerateObjectIdFn from apache_beam.io.mongodbio import _MongoSink +from apache_beam.io.mongodbio import _ObjectIdRangeTracker from apache_beam.io.mongodbio import _WriteMongoFn from apache_beam.testing.test_pipeline import TestPipeline from apache_beam.testing.util import assert_that from apache_beam.testing.util import equal_to +class _MockMongoColl(object): + def __init__(self, docs): +self.docs = docs + + def _filter(self, filter): +match = [] +if not filter: + return self +start = filter['_id'].get('$gte') Review comment: Please assert that start, end are both not None This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288219) > Python MongoDB IO performance and correctness issues > > > Key: BEAM-7866 > URL: https://issues.apache.org/jira/browse/BEAM-7866 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Eugene Kirpichov >Assignee: Yichi Zhang >Priority: Blocker > Fix For: 2.15.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py > splits the query result by computing number of results in constructor, and > then in each reader re-executing the whole query and getting an index > sub-range of those results. > This is broken in several critical ways: > - The order of query results returned by find() is not necessarily > deterministic, so the idea of index ranges on it is meaningless: each shard > may basically get random, possibly overlapping subsets of the total results > - Even if you add order by `_id`, the database may be changing concurrently > to reading and splitting. E.g. if the database contained documents with ids > 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the > assumption that these shards would contain respectively 10 20 30, and 40 50), > and then suppose shard 10 20 30 is read and then document 25 is inserted - > then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and > document 25 is lost. > - Every shard re-executes the query and skips the first start_offset items, > which in total is quadratic complexity > - The query is first executed in the constructor in order to count results, > which 1) means the constructor can be super slow and 2) it won't work at all > if the database is unavailable at the time the pipeline is constructed (e.g. > if this is a template). > Unfortunately, none of these issues are caught by SourceTestUtils: this class > has extensive coverage with it, and the tests pass. This is because the tests > return the same results in the same order. I don't know how to catch this > automatically, and I don't know how to catch the performance issue > automatically, but these would all be important follow-up items after the > actual fix. > CC: [~chamikara] as reviewer. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues
[ https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288220&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288220 ] ASF GitHub Bot logged work on BEAM-7866: Author: ASF GitHub Bot Created on: 02/Aug/19 20:12 Start Date: 02/Aug/19 20:12 Worklog Time Spent: 10m Work Description: jkff commented on pull request #9233: [BEAM-7866] Fix python ReadFromMongoDB potential data loss issue URL: https://github.com/apache/beam/pull/9233#discussion_r310264499 ## File path: sdks/python/apache_beam/io/mongodbio.py ## @@ -139,50 +143,77 @@ def __init__(self, self.filter = filter self.projection = projection self.spec = extra_client_params -self.doc_count = self._get_document_count() -self.avg_doc_size = self._get_avg_document_size() -self.client = None def estimate_size(self): -return self.avg_doc_size * self.doc_count +with MongoClient(self.uri, **self.spec) as client: + size = client[self.db].command('collstats', self.coll).get('size') + if size is None or size <= 0: +raise ValueError('Collection %s not found or total doc size is ' + 'incorrect' % self.coll) + return size def split(self, desired_bundle_size, start_position=None, stop_position=None): # use document cursor index as the start and stop positions if start_position is None: - start_position = 0 + epoch = datetime.datetime(1970, 1, 1) Review comment: * Please add a comment that this is an object id smaller than any possible actual object id * Does Mongo actually guarantee this property? Maybe it makes sense to explicitly query for the smallest and largest object id in the database, if it can be done quickly? * Using start and end position that are far removed from the actual ids of present objects risks having most of the splits be empty, or at least having a couple of splits at the edges that are difficult to liquid-shard. This is a known issue with e.g. bigtable and shuffle sources in Dataflow. In this sense too, querying for smallest and largest id would be better. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288220) > Python MongoDB IO performance and correctness issues > > > Key: BEAM-7866 > URL: https://issues.apache.org/jira/browse/BEAM-7866 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Eugene Kirpichov >Assignee: Yichi Zhang >Priority: Blocker > Fix For: 2.15.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py > splits the query result by computing number of results in constructor, and > then in each reader re-executing the whole query and getting an index > sub-range of those results. > This is broken in several critical ways: > - The order of query results returned by find() is not necessarily > deterministic, so the idea of index ranges on it is meaningless: each shard > may basically get random, possibly overlapping subsets of the total results > - Even if you add order by `_id`, the database may be changing concurrently > to reading and splitting. E.g. if the database contained documents with ids > 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the > assumption that these shards would contain respectively 10 20 30, and 40 50), > and then suppose shard 10 20 30 is read and then document 25 is inserted - > then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and > document 25 is lost. > - Every shard re-executes the query and skips the first start_offset items, > which in total is quadratic complexity > - The query is first executed in the constructor in order to count results, > which 1) means the constructor can be super slow and 2) it won't work at all > if the database is unavailable at the time the pipeline is constructed (e.g. > if this is a template). > Unfortunately, none of these issues are caught by SourceTestUtils: this class > has extensive coverage with it, and the tests pass. This is because the tests > return the same results in the same order. I don't know how to catch this > automatically, and I don't know how to catch the performance issue > automatically, but these would all be important follow-up items after the > actual fix. > CC: [~chamikara] as reviewer.
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288205&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288205 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 02/Aug/19 19:57 Start Date: 02/Aug/19 19:57 Worklog Time Spent: 10m Work Description: udim commented on issue #9223: [BEAM-7060] Introduce Python3-only test modules URL: https://github.com/apache/beam/pull/9223#issuecomment-517826104 run python precommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288205) Time Spent: 7.5h (was: 7h 20m) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 7.5h > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (BEAM-7724) Codegen should cast(null) to a type to match exact function signature
[ https://issues.apache.org/jira/browse/BEAM-7724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sridhar Reddy resolved BEAM-7724. - Resolution: Won't Fix Fix Version/s: Not applicable Ideally, this issue should be addressed within Calcite repo. This will not be fixed in Beam repo at this time. > Codegen should cast(null) to a type to match exact function signature > - > > Key: BEAM-7724 > URL: https://issues.apache.org/jira/browse/BEAM-7724 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Rui Wang >Assignee: sridhar Reddy >Priority: Major > Fix For: Not applicable > > > If there are two function signatures for the same function name, when input > parameter is null, Janino will throw exception due to vagueness: > A(String) > A(Integer) > Janino does not know how to match A(null). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7724) Codegen should cast(null) to a type to match exact function signature
[ https://issues.apache.org/jira/browse/BEAM-7724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899162#comment-16899162 ] sridhar Reddy commented on BEAM-7724: - As per the above discussion, I will be closing this issue. > Codegen should cast(null) to a type to match exact function signature > - > > Key: BEAM-7724 > URL: https://issues.apache.org/jira/browse/BEAM-7724 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Rui Wang >Assignee: sridhar Reddy >Priority: Major > > If there are two function signatures for the same function name, when input > parameter is null, Janino will throw exception due to vagueness: > A(String) > A(Integer) > Janino does not know how to match A(null). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7528) Save correctly Python Load Tests metrics according to it's namespace
[ https://issues.apache.org/jira/browse/BEAM-7528?focusedWorklogId=288186&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288186 ] ASF GitHub Bot logged work on BEAM-7528: Author: ASF GitHub Bot Created on: 02/Aug/19 19:11 Start Date: 02/Aug/19 19:11 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #8941: [BEAM-7528] Save load test metrics according to distribution name URL: https://github.com/apache/beam/pull/8941#discussion_r310259723 ## File path: sdks/python/apache_beam/testing/load_tests/load_test_metrics_utils.py ## @@ -138,8 +143,25 @@ def as_dict(self): class CounterMetric(Metric): def __init__(self, counter_dict, submit_timestamp, metric_id): super(CounterMetric, self).__init__(submit_timestamp, metric_id) -self.value = counter_dict.committed self.label = str(counter_dict.key.metric.name) +self.value = counter_dict.committed + + +class DistributionMetrics(Metric): + def __init__(self, dists, submit_timestamp, metric_id, key): Review comment: Distributions are unique by name, namespace and step. That means that multiple steps can collect separate metrics with the same name We could use the namespace to separate them (e.g. "BeamPerfNamespace" or whatever). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288186) Time Spent: 6.5h (was: 6h 20m) > Save correctly Python Load Tests metrics according to it's namespace > > > Key: BEAM-7528 > URL: https://issues.apache.org/jira/browse/BEAM-7528 > Project: Beam > Issue Type: Bug > Components: testing >Reporter: Kasia Kucharczyk >Assignee: Kasia Kucharczyk >Priority: Major > Time Spent: 6.5h > Remaining Estimate: 0h > > Load test framework considers all distribution metrics defined in a pipeline > as a `runtime` metric (which is defined by the loadtest framework), while > only `runtime` distribution metric should be considered as runtime. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7528) Save correctly Python Load Tests metrics according to it's namespace
[ https://issues.apache.org/jira/browse/BEAM-7528?focusedWorklogId=288185&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288185 ] ASF GitHub Bot logged work on BEAM-7528: Author: ASF GitHub Bot Created on: 02/Aug/19 19:11 Start Date: 02/Aug/19 19:11 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #8941: [BEAM-7528] Save load test metrics according to distribution name URL: https://github.com/apache/beam/pull/8941#discussion_r310263173 ## File path: sdks/python/apache_beam/testing/load_tests/load_test_metrics_utils.py ## @@ -138,8 +143,25 @@ def as_dict(self): class CounterMetric(Metric): def __init__(self, counter_dict, submit_timestamp, metric_id): super(CounterMetric, self).__init__(submit_timestamp, metric_id) -self.value = counter_dict.committed self.label = str(counter_dict.key.metric.name) +self.value = counter_dict.committed + + +class DistributionMetrics(Metric): Review comment: I am okay with either 1 or 2. Do we have full metric names( step+namespace+metric)? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288185) Time Spent: 6h 20m (was: 6h 10m) > Save correctly Python Load Tests metrics according to it's namespace > > > Key: BEAM-7528 > URL: https://issues.apache.org/jira/browse/BEAM-7528 > Project: Beam > Issue Type: Bug > Components: testing >Reporter: Kasia Kucharczyk >Assignee: Kasia Kucharczyk >Priority: Major > Time Spent: 6h 20m > Remaining Estimate: 0h > > Load test framework considers all distribution metrics defined in a pipeline > as a `runtime` metric (which is defined by the loadtest framework), while > only `runtime` distribution metric should be considered as runtime. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (BEAM-3645) Support multi-process execution on the FnApiRunner
[ https://issues.apache.org/jira/browse/BEAM-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896575#comment-16896575 ] Hannah Jiang edited comment on BEAM-3645 at 8/2/19 7:03 PM: Direct runner can now process map tasks across multiple workers. Depending on running environment, these workers are running in multithreading or multiprocessing mode. *How to run with multiprocessing mode*: {code:java} # using subprocess runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.SUBPROCESS_SDK, payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' % sys.executable.encode('ascii' {code} *How to run with multithreading mode:* {code:java} # using in memory embedded runner p = beam.Pipeline(options=pipeline_options) {code} {code:java} # using embedded grpc runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.EMBEDDED_PYTHON_GRPC, payload=b'1'))) # payload is # of threads of each worker.{code} *--direct_num_workers* option is used to control number of workers. Default value is 1. {code:java} # an example to pass it from CLI. python wordcount.py --input xx --output xx --direct_num_workers 2 # an example to set it with PipelineOptions. from apache_beam.options.pipeline_options import PipelineOptions pipeline_options = PipelineOptions(['--direct_num_workers', '2']) # an example add it to existing pipeline options. from apache_beam.options.pipeline_options import DirectOptions pipeline_options = xxx pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code} was (Author: hannahjiang): Direct runner can now process map tasks across multiple workers. Depending on running environment, these workers are running in multithreading or multiprocessing mode. *How to run with multiprocessing mode*: {code:java} # using subprocess runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.SUBPROCESS_SDK, payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' % sys.executable.encode('ascii' {code} *How to run with multithreading mode:* {code:java} # using in memory embedded runner p = beam.Pipeline(options=pipeline_options) {code} {code:java} # using embedded grpc runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.EMBEDDED_PYTHON_GRPC, payload=b'1'))) # payload is # of threads of each worker.{code} *--direct_num_workers* option is used to control number of workers. Default value is 1. {code:java} # an example to pass it from CLI. python wordcount.py --input xx --output xx --direct_num_workers 2 # an example to set it with PipelineOptions. from apache_beam.options.pipeline_options import PipelineOptions pipeline_options = PipelineOptions(['--direct_num_workers', '2']) # an example add it to existing pipeline options. pipeline_options = xxx pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code} > Support multi-process execution on the FnApiRunner > -- > > Key: BEAM-3645 > URL: https://issues.apache.org/jira/browse/BEAM-3645 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Charles Chen >Assignee: Hannah Jiang >Priority: Major > Fix For: 2.15.0 > > Time Spent: 35h 20m > Remaining Estimate: 0h > > https://issues.apache.org/jira/browse/BEAM-3644 gave us a 15x performance > gain over the previous DirectRunner. We can do even better in multi-core > environments by supporting multi-process execution in the FnApiRunner, to > scale past Python GIL limitations. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (BEAM-3645) Support multi-process execution on the FnApiRunner
[ https://issues.apache.org/jira/browse/BEAM-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896575#comment-16896575 ] Hannah Jiang edited comment on BEAM-3645 at 8/2/19 7:02 PM: Direct runner can now process map tasks across multiple workers. Depending on running environment, these workers are running in multithreading or multiprocessing mode. *How to run with multiprocessing mode*: {code:java} # using subprocess runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.SUBPROCESS_SDK, payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' % sys.executable.encode('ascii' {code} *How to run with multithreading mode:* {code:java} # using in memory embedded runner p = beam.Pipeline(options=pipeline_options) {code} {code:java} # using embedded grpc runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.EMBEDDED_PYTHON_GRPC, payload=b'1'))) # payload is # of threads of each worker.{code} *--direct_num_workers* option is used to control number of workers. Default value is 1. {code:java} # an example to pass it from CLI. python wordcount.py --input xx --output xx --direct_num_workers 2 # an example to set it with PipelineOptions. from apache_beam.options.pipeline_options import PipelineOptions pipeline_options = PipelineOptions(['--direct_num_workers', '2']) # an example add it to existing pipeline options. pipeline_options = xxx pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code} was (Author: hannahjiang): Direct runner can now process map tasks across multiple workers. Depending on running environment, these workers are running in multithreading or multiprocessing mode. *How to run with multiprocessing mode*: {code:java} # using subprocess runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.SUBPROCESS_SDK, payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' % sys.executable.encode('ascii' {code} *How to run with multithreading mode:* {code:java} # using in memory embedded runner p = beam.Pipeline(options=pipeline_options) {code} {code:java} # using embedded grpc runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.EMBEDDED_PYTHON_GRPC, payload=b'1'))) # payload is # of threads of each worker.{code} *--direct_num_workers* option is used to control number of workers. Default value is 1. {code:java} # an example to pass it from CLI. python wordcount.py --input xx --output xx --direct_num_workers 2 # an example to set it with pipeline options directly. from apache_beam.options.pipeline_options import PipelineOptions pipeline_options = PipelineOptions(['--direct_num_workers', '2']) # an example add it to existing pipeline options. pipeline_options = xxx pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code} > Support multi-process execution on the FnApiRunner > -- > > Key: BEAM-3645 > URL: https://issues.apache.org/jira/browse/BEAM-3645 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Charles Chen >Assignee: Hannah Jiang >Priority: Major > Fix For: 2.15.0 > > Time Spent: 35h 20m > Remaining Estimate: 0h > > https://issues.apache.org/jira/browse/BEAM-3644 gave us a 15x performance > gain over the previous DirectRunner. We can do even better in multi-core > environments by supporting multi-process execution in the FnApiRunner, to > scale past Python GIL limitations. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288173&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288173 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 02/Aug/19 18:55 Start Date: 02/Aug/19 18:55 Worklog Time Spent: 10m Work Description: tvalentyn commented on issue #9223: [BEAM-7060] Introduce Python3-only test modules URL: https://github.com/apache/beam/pull/9223#issuecomment-517809174 A similar change was reverted because it broke postcommits: https://github.com/apache/beam/pull/8505#issuecomment-498441270 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288173) Time Spent: 7h 20m (was: 7h 10m) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 7h 20m > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288171&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288171 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 02/Aug/19 18:53 Start Date: 02/Aug/19 18:53 Worklog Time Spent: 10m Work Description: udim commented on issue #9223: [BEAM-7060] Introduce Python3-only test modules URL: https://github.com/apache/beam/pull/9223#issuecomment-517808582 @tvalentyn I believe pre-commits cover all the changes in this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288171) Time Spent: 7h 10m (was: 7h) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 7h 10m > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7476) Datastore write failures with "[Errno 32] Broken pipe"
[ https://issues.apache.org/jira/browse/BEAM-7476?focusedWorklogId=288170&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288170 ] ASF GitHub Bot logged work on BEAM-7476: Author: ASF GitHub Bot Created on: 02/Aug/19 18:51 Start Date: 02/Aug/19 18:51 Worklog Time Spent: 10m Work Description: udim commented on pull request #8346: [BEAM-7476] Retry Datastore writes on [Errno 32] Broken pipe URL: https://github.com/apache/beam/pull/8346 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288170) Time Spent: 1.5h (was: 1h 20m) > Datastore write failures with "[Errno 32] Broken pipe" > -- > > Key: BEAM-7476 > URL: https://issues.apache.org/jira/browse/BEAM-7476 > Project: Beam > Issue Type: Bug > Components: io-python-gcp >Affects Versions: 2.11.0 > Environment: dataflow python 2.7 >Reporter: Dmytro Sadovnychyi >Assignee: Dmytro Sadovnychyi >Priority: Minor > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > We are getting lots of Broken pipe errors and it's only a matter of luck for > write to succeed. It's been happening for months. > Partial stack trace: > File > "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/datastore/v1/helper.py", > line 225, in commit > response = datastore.commit(request) > File > "/usr/local/lib/python2.7/dist-packages/googledatastore/connection.py", line > 140, in commit > datastore_pb2.CommitResponse) > File > "/usr/local/lib/python2.7/dist-packages/googledatastore/connection.py", line > 199, in _call_method > method='POST', body=payload, headers=headers) > File "/usr/local/lib/python2.7/dist-packages/oauth2client/transport.py", > line 169, in new_request > redirections, connection_type) > File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line > 1609, in request > (response, content) = self._request(conn, authority, uri, request_uri, > method, body, headers, redirections, cachekey) > File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line > 1351, in _request > (response, content) = self._conn_request(conn, request_uri, method, body, > headers) > File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line > 1273, in _conn_request > conn.request(method, request_uri, body, headers) > File "/usr/lib/python2.7/httplib.py", line 1042, in request > self._send_request(method, url, body, headers) > File "/usr/lib/python2.7/httplib.py", line 1082, in _send_request > self.endheaders(body) > File "/usr/lib/python2.7/httplib.py", line 1038, in endheaders > self._send_output(message_body) > File "/usr/lib/python2.7/httplib.py", line 882, in _send_output > self.send(msg) > File "/usr/lib/python2.7/httplib.py", line 858, in send > self.sock.sendall(data) > File "/usr/lib/python2.7/ssl.py", line 753, in sendall > v = self.send(data[count:]) > File "/usr/lib/python2.7/ssl.py", line 719, in send > v = self._sslobj.write(data) > RuntimeError: error: [Errno 32] Broken pipe [while running 'Groups to > datastore/Write Mutation to Datastore'] > Workaround: https://github.com/apache/beam/pull/8346 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7884) Run pylint in Python 3
[ https://issues.apache.org/jira/browse/BEAM-7884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899137#comment-16899137 ] Udi Meiri commented on BEAM-7884: - Optional step 0: Since files ending with py3.py (introduced in https://github.com/apache/beam/pull/9223) are currently excluded from run_pylint.sh checks (which on run on Python 2), the highest priority would be to test those files. > Run pylint in Python 3 > -- > > Key: BEAM-7884 > URL: https://issues.apache.org/jira/browse/BEAM-7884 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Udi Meiri >Priority: Major > > Currently run_pylint.sh only runs in tox environment py27-lint. > py35-lint uses another script named run_mini_py3lint.sh. > 1. Make run_pylint.sh run and pass for py35-lint. Might have to add some > ignores to .pylintrc (such as useless-object-inheritance), and might have to > rename some deprecated functions (such as > s/assertRaisesRegexp/assertRaisesRegex/). > 2. Make sure all tests in run_mini_py3lint.sh are in run_pylint.sh and remove > run_mini_py3lint.sh. > 3. (optional) Remove py27-lint? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7577) Allow the use of ValueProviders in datastore.v1new.datastoreio.ReadFromDatastore query
[ https://issues.apache.org/jira/browse/BEAM-7577?focusedWorklogId=288165&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288165 ] ASF GitHub Bot logged work on BEAM-7577: Author: ASF GitHub Bot Created on: 02/Aug/19 18:45 Start Date: 02/Aug/19 18:45 Worklog Time Spent: 10m Work Description: udim commented on pull request #8950: [BEAM-7577] Allow ValueProviders in Datastore Query filters URL: https://github.com/apache/beam/pull/8950 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288165) Time Spent: 6h 40m (was: 6.5h) > Allow the use of ValueProviders in > datastore.v1new.datastoreio.ReadFromDatastore query > -- > > Key: BEAM-7577 > URL: https://issues.apache.org/jira/browse/BEAM-7577 > Project: Beam > Issue Type: New Feature > Components: io-python-gcp >Affects Versions: 2.13.0 >Reporter: EDjur >Assignee: EDjur >Priority: Minor > Time Spent: 6h 40m > Remaining Estimate: 0h > > The current implementation of ReadFromDatastore does not support specifying > the query parameter at runtime. This could potentially be fixed through the > usage of a ValueProvider to specify and build the Datastore query. > Allowing specifying the query at runtime makes it easier to use dynamic > queries in Dataflow templates. Currently, there is no way to have a Dataflow > template that includes a dynamic query (such as filtering by a timestamp or > similar). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288163&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288163 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 02/Aug/19 18:43 Start Date: 02/Aug/19 18:43 Worklog Time Spent: 10m Work Description: udim commented on pull request #9223: [BEAM-7060] Introduce Python3-only test modules URL: https://github.com/apache/beam/pull/9223#discussion_r310254106 ## File path: sdks/python/scripts/run_pylint.sh ## @@ -60,23 +60,34 @@ EXCLUDED_GENERATED_FILES=( apache_beam/portability/api/*pb2*.py ) +PYTHON_MAJOR=$(python -c 'import sys; print(sys.version_info[0])') +if [[ "${PYTHON_MAJOR}" == 2 ]]; then + EXCLUDED_PY3_FILES=$(find ${MODULE} | grep 'py3\.py$') + echo -e "Excluding Py3 files:\n${EXCLUDED_PY3_FILES}" +else + EXCLUDED_PY3_FILES="" +fi + FILES_TO_IGNORE="" -for file in "${EXCLUDED_GENERATED_FILES[@]}"; do +for file in "${EXCLUDED_GENERATED_FILES[@]}" ${EXCLUDED_PY3_FILES}; do Review comment: Yes, see: https://scans.gradle.com/s/544uyirsdn6n4/console-log#L206-L209 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288163) Time Spent: 7h (was: 6h 50m) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 7h > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288157&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288157 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 02/Aug/19 18:36 Start Date: 02/Aug/19 18:36 Worklog Time Spent: 10m Work Description: udim commented on pull request #9223: [BEAM-7060] Introduce Python3-only test modules URL: https://github.com/apache/beam/pull/9223#discussion_r310251805 ## File path: sdks/python/scripts/run_pylint.sh ## @@ -60,23 +60,34 @@ EXCLUDED_GENERATED_FILES=( apache_beam/portability/api/*pb2*.py ) +PYTHON_MAJOR=$(python -c 'import sys; print(sys.version_info[0])') Review comment: done: https://issues.apache.org/jira/browse/BEAM-7884 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288157) Time Spent: 6h 50m (was: 6h 40m) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 6h 50m > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (BEAM-7884) Run pylint in Python 3
Udi Meiri created BEAM-7884: --- Summary: Run pylint in Python 3 Key: BEAM-7884 URL: https://issues.apache.org/jira/browse/BEAM-7884 Project: Beam Issue Type: Sub-task Components: sdk-py-core Reporter: Udi Meiri Currently run_pylint.sh only runs in tox environment py27-lint. py35-lint uses another script named run_mini_py3lint.sh. 1. Make run_pylint.sh run and pass for py35-lint. Might have to add some ignores to .pylintrc (such as useless-object-inheritance), and might have to rename some deprecated functions (such as s/assertRaisesRegexp/assertRaisesRegex/). 2. Make sure all tests in run_mini_py3lint.sh are in run_pylint.sh and remove run_mini_py3lint.sh. 3. (optional) Remove py27-lint? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7868) Hidden Flink Runner parameters are dropped in python pipelines
[ https://issues.apache.org/jira/browse/BEAM-7868?focusedWorklogId=288156&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288156 ] ASF GitHub Bot logged work on BEAM-7868: Author: ASF GitHub Bot Created on: 02/Aug/19 18:34 Start Date: 02/Aug/19 18:34 Worklog Time Spent: 10m Work Description: angoenka commented on pull request #9211: [BEAM-7868] optionally skip hidden pipeline options URL: https://github.com/apache/beam/pull/9211 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288156) Time Spent: 0.5h (was: 20m) > Hidden Flink Runner parameters are dropped in python pipelines > -- > > Key: BEAM-7868 > URL: https://issues.apache.org/jira/browse/BEAM-7868 > Project: Beam > Issue Type: Bug > Components: runner-flink >Reporter: Ankur Goenka >Assignee: Ankur Goenka >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Hidden pipeline options for Portable flink runner are not interpreted by > Python SDK. > An example of this is > ManualDockerEnvironmentOptions.getRetainDockerContainers which is not > interpreted in python sdk. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288155&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288155 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 02/Aug/19 18:29 Start Date: 02/Aug/19 18:29 Worklog Time Spent: 10m Work Description: tvalentyn commented on issue #9223: [BEAM-7060] Introduce Python3-only test modules URL: https://github.com/apache/beam/pull/9223#issuecomment-517801554 Run Python 2 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288155) Time Spent: 6h 40m (was: 6.5h) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 6h 40m > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288154&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288154 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 02/Aug/19 18:28 Start Date: 02/Aug/19 18:28 Worklog Time Spent: 10m Work Description: tvalentyn commented on pull request #9223: [BEAM-7060] Introduce Python3-only test modules URL: https://github.com/apache/beam/pull/9223#discussion_r310248802 ## File path: sdks/python/scripts/run_pylint.sh ## @@ -60,23 +60,34 @@ EXCLUDED_GENERATED_FILES=( apache_beam/portability/api/*pb2*.py ) +PYTHON_MAJOR=$(python -c 'import sys; print(sys.version_info[0])') +if [[ "${PYTHON_MAJOR}" == 2 ]]; then + EXCLUDED_PY3_FILES=$(find ${MODULE} | grep 'py3\.py$') + echo -e "Excluding Py3 files:\n${EXCLUDED_PY3_FILES}" +else + EXCLUDED_PY3_FILES="" +fi + FILES_TO_IGNORE="" -for file in "${EXCLUDED_GENERATED_FILES[@]}"; do +for file in "${EXCLUDED_GENERATED_FILES[@]}" ${EXCLUDED_PY3_FILES}; do Review comment: Does this loop work as intended when $EXCLUDED_PY3_FILES has more than one file? Alternative solution: https://github.com/apache/beam/pull/8505/files#diff-d91ac373b8d2235683a85576912b5db5R66 which requires: https://github.com/apache/beam/pull/8505/files#diff-d91ac373b8d2235683a85576912b5db5R87 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288154) Time Spent: 6.5h (was: 6h 20m) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 6.5h > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (BEAM-3645) Support multi-process execution on the FnApiRunner
[ https://issues.apache.org/jira/browse/BEAM-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896575#comment-16896575 ] Hannah Jiang edited comment on BEAM-3645 at 8/2/19 6:24 PM: Direct runner can now process map tasks across multiple workers. Depending on running environment, these workers are running in multithreading or multiprocessing mode. *How to run with multiprocessing mode*: {code:java} # using subprocess runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.SUBPROCESS_SDK, payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' % sys.executable.encode('ascii' {code} *How to run with multithreading mode:* {code:java} # using in memory embedded runner p = beam.Pipeline(options=pipeline_options) {code} {code:java} # using embedded grpc runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.EMBEDDED_PYTHON_GRPC, payload=b'1'))) # payload is # of threads of each worker.{code} *--direct_num_workers* option is used to control number of workers. Default value is 1. {code:java} # an example to pass from CLI. python wordcount.py --input xx --output xx --direct_num_workers 2 # an example to set it with pipeline options directly. from apache_beam.options.pipeline_options import PipelineOptions pipeline_options = PipelineOptions(['--direct_num_workers', '2']) # an example add it to existing pipeline options. pipeline_options = xxx pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code} was (Author: hannahjiang): Direct runner can now process map tasks across multiple workers. Depending on running environment, these workers are running in multithreading or multiprocessing mode. *How to run with multiprocessing mode*: {code:java} # using subprocess runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.SUBPROCESS_SDK, payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' % sys.executable.encode('ascii' {code} *How to run with multithreading mode:* {code:java} # using in memory embedded runner p = beam.Pipeline(options=pipeline_options) {code} {code:java} # using embedded grpc runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.EMBEDDED_PYTHON_GRPC, payload=b'1'))) # payload is # of threads of each worker.{code} *--direct_num_workers* option is used to control number of workers. Default value is 1. {code:java} # an example to pass --direct_num_workers to a job. python wordcount.py --input xx --output xx --direct_num_workers 2 # an example to set it with pipeline options directly. from apache_beam.options.pipeline_options import PipelineOptions pipeline_options = PipelineOptions(['--direct_num_workers', '2']) # an example add it to existing pipeline options. pipeline_options = xxx pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code} > Support multi-process execution on the FnApiRunner > -- > > Key: BEAM-3645 > URL: https://issues.apache.org/jira/browse/BEAM-3645 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Charles Chen >Assignee: Hannah Jiang >Priority: Major > Fix For: 2.15.0 > > Time Spent: 35h 20m > Remaining Estimate: 0h > > https://issues.apache.org/jira/browse/BEAM-3644 gave us a 15x performance > gain over the previous DirectRunner. We can do even better in multi-core > environments by supporting multi-process execution in the FnApiRunner, to > scale past Python GIL limitations. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (BEAM-3645) Support multi-process execution on the FnApiRunner
[ https://issues.apache.org/jira/browse/BEAM-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896575#comment-16896575 ] Hannah Jiang edited comment on BEAM-3645 at 8/2/19 6:24 PM: Direct runner can now process map tasks across multiple workers. Depending on running environment, these workers are running in multithreading or multiprocessing mode. *How to run with multiprocessing mode*: {code:java} # using subprocess runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.SUBPROCESS_SDK, payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' % sys.executable.encode('ascii' {code} *How to run with multithreading mode:* {code:java} # using in memory embedded runner p = beam.Pipeline(options=pipeline_options) {code} {code:java} # using embedded grpc runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.EMBEDDED_PYTHON_GRPC, payload=b'1'))) # payload is # of threads of each worker.{code} *--direct_num_workers* option is used to control number of workers. Default value is 1. {code:java} # an example to pass it from CLI. python wordcount.py --input xx --output xx --direct_num_workers 2 # an example to set it with pipeline options directly. from apache_beam.options.pipeline_options import PipelineOptions pipeline_options = PipelineOptions(['--direct_num_workers', '2']) # an example add it to existing pipeline options. pipeline_options = xxx pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code} was (Author: hannahjiang): Direct runner can now process map tasks across multiple workers. Depending on running environment, these workers are running in multithreading or multiprocessing mode. *How to run with multiprocessing mode*: {code:java} # using subprocess runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.SUBPROCESS_SDK, payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' % sys.executable.encode('ascii' {code} *How to run with multithreading mode:* {code:java} # using in memory embedded runner p = beam.Pipeline(options=pipeline_options) {code} {code:java} # using embedded grpc runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.EMBEDDED_PYTHON_GRPC, payload=b'1'))) # payload is # of threads of each worker.{code} *--direct_num_workers* option is used to control number of workers. Default value is 1. {code:java} # an example to pass from CLI. python wordcount.py --input xx --output xx --direct_num_workers 2 # an example to set it with pipeline options directly. from apache_beam.options.pipeline_options import PipelineOptions pipeline_options = PipelineOptions(['--direct_num_workers', '2']) # an example add it to existing pipeline options. pipeline_options = xxx pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code} > Support multi-process execution on the FnApiRunner > -- > > Key: BEAM-3645 > URL: https://issues.apache.org/jira/browse/BEAM-3645 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Charles Chen >Assignee: Hannah Jiang >Priority: Major > Fix For: 2.15.0 > > Time Spent: 35h 20m > Remaining Estimate: 0h > > https://issues.apache.org/jira/browse/BEAM-3644 gave us a 15x performance > gain over the previous DirectRunner. We can do even better in multi-core > environments by supporting multi-process execution in the FnApiRunner, to > scale past Python GIL limitations. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (BEAM-3645) Support multi-process execution on the FnApiRunner
[ https://issues.apache.org/jira/browse/BEAM-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896575#comment-16896575 ] Hannah Jiang edited comment on BEAM-3645 at 8/2/19 6:22 PM: Direct runner can now process map tasks across multiple workers. Depending on running environment, these workers are running in multithreading or multiprocessing mode. *How to run with multiprocessing mode*: {code:java} # using subprocess runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.SUBPROCESS_SDK, payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' % sys.executable.encode('ascii' {code} *How to run with multithreading mode:* {code:java} # using in memory embedded runner p = beam.Pipeline(options=pipeline_options) {code} {code:java} # using embedded grpc runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.EMBEDDED_PYTHON_GRPC, payload=b'1'))) # payload is # of threads of each worker.{code} *--direct_num_workers* option is used to control number of workers. Default value is 1. {code:java} # an example to pass --direct_num_workers to a job. python wordcount.py --input xx --output xx --direct_num_workers 2 # an example to set it with pipeline options directly. from apache_beam.options.pipeline_options import PipelineOptions pipeline_options = PipelineOptions(['--direct_num_workers', '2']) # an example add it to existing pipeline options. pipeline_options = xxx pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code} was (Author: hannahjiang): Direct runner can now process map tasks across multiple workers. Depending on running environment, these workers are running in multithreading or multiprocessing mode. *How to run with multiprocessing mode*: {code:java} # using subprocess runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.SUBPROCESS_SDK, payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' % sys.executable.encode('ascii' {code} *How to run with multithreading mode:* {code:java} # using in memory embedded runner p = beam.Pipeline(options=pipeline_options) {code} {code:java} # using embedded grpc runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.EMBEDDED_PYTHON_GRPC, payload=b'1'))) # payload is # of threads of each worker.{code} *--direct_num_workers* option is used to control number of workers. Default value is 1. {code:java} # an example to pass --direct_num_workers to a job. python wordcount.py --input xx --output xx --direct_num_workers 2 # an example to set it with pipeline options directly. from apache_beam.options.pipeline_options import PipelineOptions pipeline_options = PipelineOptions(['--direct_num_workers', '2']) # an example add to existing pipeline options. pipeline_options = xxx pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code} > Support multi-process execution on the FnApiRunner > -- > > Key: BEAM-3645 > URL: https://issues.apache.org/jira/browse/BEAM-3645 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Charles Chen >Assignee: Hannah Jiang >Priority: Major > Fix For: 2.15.0 > > Time Spent: 35h 20m > Remaining Estimate: 0h > > https://issues.apache.org/jira/browse/BEAM-3644 gave us a 15x performance > gain over the previous DirectRunner. We can do even better in multi-core > environments by supporting multi-process execution in the FnApiRunner, to > scale past Python GIL limitations. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (BEAM-3645) Support multi-process execution on the FnApiRunner
[ https://issues.apache.org/jira/browse/BEAM-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896575#comment-16896575 ] Hannah Jiang edited comment on BEAM-3645 at 8/2/19 6:22 PM: Direct runner can now process map tasks across multiple workers. Depending on running environment, these workers are running in multithreading or multiprocessing mode. *How to run with multiprocessing mode*: {code:java} # using subprocess runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.SUBPROCESS_SDK, payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' % sys.executable.encode('ascii' {code} *How to run with multithreading mode:* {code:java} # using in memory embedded runner p = beam.Pipeline(options=pipeline_options) {code} {code:java} # using embedded grpc runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.EMBEDDED_PYTHON_GRPC, payload=b'1'))) # payload is # of threads of each worker.{code} *--direct_num_workers* option is used to control number of workers. Default value is 1. {code:java} # an example to pass --direct_num_workers to a job. python wordcount.py --input xx --output xx --direct_num_workers 2 # an example to set it with pipeline options directly. from apache_beam.options.pipeline_options import PipelineOptions pipeline_options = PipelineOptions(['--direct_num_workers', '2']) # an example add to existing pipeline options. pipeline_options = xxx pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code} was (Author: hannahjiang): Direct runner can now process map tasks across multiple workers. Depending on running environment, these workers are running in multithreading or multiprocessing mode. *How to run with multiprocessing mode*: {code:java} # using subprocess runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.SUBPROCESS_SDK, payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' % sys.executable.encode('ascii' {code} *How to run with multithreading mode:* {code:java} # using in memory embedded runner p = beam.Pipeline(options=pipeline_options) {code} {code:java} # using embedded grpc runner p = beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( default_environment=beam_runner_api_pb2.Environment( urn=python_urns.EMBEDDED_PYTHON_GRPC, payload=b'1'))) # payload is # of threads of each worker.{code} *--direct_num_workers* option is used to control number of workers. Default value is 1. {code:java} # an example to pass --direct_num_workers to a job. python wordcount.py --input xx --output xx --direct_num_workers 2 {code} > Support multi-process execution on the FnApiRunner > -- > > Key: BEAM-3645 > URL: https://issues.apache.org/jira/browse/BEAM-3645 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Charles Chen >Assignee: Hannah Jiang >Priority: Major > Fix For: 2.15.0 > > Time Spent: 35h 20m > Remaining Estimate: 0h > > https://issues.apache.org/jira/browse/BEAM-3644 gave us a 15x performance > gain over the previous DirectRunner. We can do even better in multi-core > environments by supporting multi-process execution in the FnApiRunner, to > scale past Python GIL limitations. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=288150&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288150 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 02/Aug/19 18:17 Start Date: 02/Aug/19 18:17 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r310244911 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java ## @@ -0,0 +1,350 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.zetasketch; + +import com.google.zetasketch.HyperLogLogPlusPlus; +import org.apache.beam.sdk.annotations.Experimental; +import org.apache.beam.sdk.transforms.Combine; +import org.apache.beam.sdk.transforms.DoFn; +import org.apache.beam.sdk.transforms.PTransform; +import org.apache.beam.sdk.transforms.ParDo; +import org.apache.beam.sdk.values.KV; +import org.apache.beam.sdk.values.PCollection; + +/** + * {@code PTransform}s to compute HyperLogLogPlusPlus (HLL++) sketches on data streams based on the + * https://github.com/google/zetasketch";>ZetaSketch implementation. + * + * HLL++ is an algorithm implemented by Google that estimates the count of distinct elements in a + * data stream. HLL++ requires significantly less memory than the linear memory needed for exact + * computation, at the cost of a small error. Cardinalities of arbitrary breakdowns can be computed + * using the HLL++ sketch. See this http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf";>published + * paper for details about the algorithm. + * + * HLL++ functions are also supported in https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions";>Google Cloud + * BigQuery. Using the {@code HllCount PTransform}s makes the interoperation with BigQuery + * easier. + * + * For detailed design of this class, see https://s.apache.org/hll-in-beam. + * + * Examples + * + * Example 1: Create long-type sketch for a {@code PCollection} and specify precision + * + * {@code + * PCollection input = ...; + * int p = ...; + * PCollection sketch = input.apply(HllCount.Init.longSketch().withPrecision(p).globally()); + * } + * + * Example 2: Create bytes-type sketch for a {@code PCollection>} + * + * {@code + * PCollection> input = ...; + * PCollection> sketch = input.apply(HllCount.Init.bytesSketch().perKey()); + * } + * + * Example 3: Merge existing sketches in a {@code PCollection} into a new one + * + * {@code + * PCollection sketches = ...; + * PCollection mergedSketch = sketches.apply(HllCount.MergePartial.globally()); + * } + * + * Example 4: Estimates the count of distinct elements in a {@code PCollection} + * + * {@code + * PCollection input = ...; + * PCollection countDistinct = + * input.apply(HllCount.Init.stringSketch().globally()).apply(HllCount.Extract.globally()); + * } + * + * Note: Currently HllCount does not work on FnAPI workers. See https://issues.apache.org/jira/browse/BEAM-7879";>Jira ticket [BEAM-7879]. + */ +@Experimental +public final class HllCount { + + public static final int MINIMUM_PRECISION = HyperLogLogPlusPlus.MINIMUM_PRECISION; Review comment: Ah, got it! But the Javadoc does not expand the actual numbers, or does it? Maybe leave a comment with "constants replicated here for documentation purposes / to make them easy to find for users"? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288150) Time Spent: 9h 40m (was: 9.5h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implement
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288145&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288145 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 02/Aug/19 18:15 Start Date: 02/Aug/19 18:15 Worklog Time Spent: 10m Work Description: chamikaramj commented on pull request #9223: [BEAM-7060] Introduce Python3-only test modules URL: https://github.com/apache/beam/pull/9223#discussion_r310242868 ## File path: sdks/python/scripts/run_pylint.sh ## @@ -60,23 +60,34 @@ EXCLUDED_GENERATED_FILES=( apache_beam/portability/api/*pb2*.py ) +PYTHON_MAJOR=$(python -c 'import sys; print(sys.version_info[0])') Review comment: Please add a JIRA to do lint checks for *py3.py files. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288145) Time Spent: 6h 20m (was: 6h 10m) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 6h 20m > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7874) FnApi only supports up to 10 workers
[ https://issues.apache.org/jira/browse/BEAM-7874?focusedWorklogId=288141&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288141 ] ASF GitHub Bot logged work on BEAM-7874: Author: ASF GitHub Bot Created on: 02/Aug/19 18:13 Start Date: 02/Aug/19 18:13 Worklog Time Spent: 10m Work Description: Hannah-Jiang commented on issue #9218: [BEAM-7874], [BEAM-7873] FnApi bugfixs URL: https://github.com/apache/beam/pull/9218#issuecomment-517416968 R: @robertwb Can you please take a look and merge if it looks good? Thank you. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288141) Time Spent: 50m (was: 40m) > FnApi only supports up to 10 workers > > > Key: BEAM-7874 > URL: https://issues.apache.org/jira/browse/BEAM-7874 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Hannah Jiang >Assignee: Hannah Jiang >Priority: Major > Fix For: 2.15.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Because max_workers of grpc servers are hardcoded to 10, it only supports up > to 10 workers, and if we pass more direct_num_workers greater than 10, > pipeline hangs, because not all workers get connected to the runner. > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/fn_api_runner.py#L1141] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7874) FnApi only supports up to 10 workers
[ https://issues.apache.org/jira/browse/BEAM-7874?focusedWorklogId=288143&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288143 ] ASF GitHub Bot logged work on BEAM-7874: Author: ASF GitHub Bot Created on: 02/Aug/19 18:13 Start Date: 02/Aug/19 18:13 Worklog Time Spent: 10m Work Description: Hannah-Jiang commented on pull request #9218: [BEAM-7874], [BEAM-7873] FnApi bugfixs URL: https://github.com/apache/beam/pull/9218#discussion_r310240952 ## File path: sdks/python/apache_beam/runners/portability/local_job_service.py ## @@ -200,10 +202,13 @@ def run(self): if self._worker_id: env_dict['WORKER_ID'] = self._worker_id -p = subprocess.Popen( -self._worker_command_line, -shell=True, -env=env_dict) +# deadlock would happen with py2, so add a lock here. +# no problem with py3. +with SubprocessSdkWorker._lock: + p = subprocess.Popen( Review comment: It turns out that deadlock happens with Popen(), rather than with p.wait(). Ran 1500 times of wordcount.py with 3 workers and confirmed deadlock doesn't happen. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288143) Time Spent: 1h (was: 50m) > FnApi only supports up to 10 workers > > > Key: BEAM-7874 > URL: https://issues.apache.org/jira/browse/BEAM-7874 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Hannah Jiang >Assignee: Hannah Jiang >Priority: Major > Fix For: 2.15.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Because max_workers of grpc servers are hardcoded to 10, it only supports up > to 10 workers, and if we pass more direct_num_workers greater than 10, > pipeline hangs, because not all workers get connected to the runner. > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/fn_api_runner.py#L1141] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.
[ https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288140&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288140 ] ASF GitHub Bot logged work on BEAM-7060: Author: ASF GitHub Bot Created on: 02/Aug/19 18:12 Start Date: 02/Aug/19 18:12 Worklog Time Spent: 10m Work Description: chamikaramj commented on pull request #9223: [BEAM-7060] Introduce Python3-only test modules URL: https://github.com/apache/beam/pull/9223#discussion_r310242868 ## File path: sdks/python/scripts/run_pylint.sh ## @@ -60,23 +60,34 @@ EXCLUDED_GENERATED_FILES=( apache_beam/portability/api/*pb2*.py ) +PYTHON_MAJOR=$(python -c 'import sys; print(sys.version_info[0])') Review comment: Please add a JIRA to do lint checks for Python 3. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288140) Time Spent: 6h 10m (was: 6h) > Design Py3-compatible typehints annotation support in Beam 3. > - > > Key: BEAM-7060 > URL: https://issues.apache.org/jira/browse/BEAM-7060 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Udi Meiri >Priority: Major > Time Spent: 6h 10m > Remaining Estimate: 0h > > Existing [Typehints implementaiton in > Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/ > ] heavily relies on internal details of CPython implementation, and some of > the assumptions of this implementation broke as of Python 3.6, see for > example: https://issues.apache.org/jira/browse/BEAM-6877, which makes > typehints support unusable on Python 3.6 as of now. [Python 3 Kanban > Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail] > lists several specific typehints-related breakages, prefixed with "TypeHints > Py3 Error". > We need to decide whether to: > - Deprecate in-house typehints implementation. > - Continue to support in-house implementation, which at this point is a stale > code and has other known issues. > - Attempt to use some off-the-shelf libraries for supporting > type-annotations, like Pytype, Mypy, PyAnnotate. > WRT to this decision we also need to plan on immediate next steps to unblock > adoption of Beam for Python 3.6+ users. One potential option may be to have > Beam SDK ignore any typehint annotations on Py 3.6+. > cc: [~udim], [~altay], [~robertwb]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (BEAM-7874) FnApi only supports up to 10 workers
[ https://issues.apache.org/jira/browse/BEAM-7874?focusedWorklogId=288137&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288137 ] ASF GitHub Bot logged work on BEAM-7874: Author: ASF GitHub Bot Created on: 02/Aug/19 18:06 Start Date: 02/Aug/19 18:06 Worklog Time Spent: 10m Work Description: Hannah-Jiang commented on pull request #9218: [BEAM-7874], [BEAM-7873] FnApi bugfixs URL: https://github.com/apache/beam/pull/9218#discussion_r310240952 ## File path: sdks/python/apache_beam/runners/portability/local_job_service.py ## @@ -200,10 +202,13 @@ def run(self): if self._worker_id: env_dict['WORKER_ID'] = self._worker_id -p = subprocess.Popen( -self._worker_command_line, -shell=True, -env=env_dict) +# deadlock would happen with py2, so add a lock here. +# no problem with py3. +with SubprocessSdkWorker._lock: + p = subprocess.Popen( Review comment: It turns out that deadlock happens with Popen(), rather than with p.wait(). Ran 1500 of wordcount.py with 3 workers and confirmed deadlock doesn't happen. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 288137) Time Spent: 40m (was: 0.5h) > FnApi only supports up to 10 workers > > > Key: BEAM-7874 > URL: https://issues.apache.org/jira/browse/BEAM-7874 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Hannah Jiang >Assignee: Hannah Jiang >Priority: Major > Fix For: 2.15.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Because max_workers of grpc servers are hardcoded to 10, it only supports up > to 10 workers, and if we pass more direct_num_workers greater than 10, > pipeline hangs, because not all workers get connected to the runner. > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/fn_api_runner.py#L1141] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (BEAM-7873) FnApi with Subprocess runner hangs frequently when running with multi workers with py2
[ https://issues.apache.org/jira/browse/BEAM-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hannah Jiang updated BEAM-7873: --- Description: Pipeline hangs at [subprocess.Popen()|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203] when shut it down. I looked into source code of subprocess lib. [py27|https://github.com/enthought/Python-2.7.3/blob/master/Lib/subprocess.py#L1286] doesn't do any lock while [py3|https://github.com/python/cpython/blob/3.7/Lib/subprocess.py#L1592] locks when waiting. Py3 added locks at other places of Popen() as well, all unlocked places with py2 may contribute to the problem. We can add a lock when calling Popen() to prevent the deadlock. (was: Pipeline hangs at [subprocess.Popen()|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203]when shut it down. I looked into source code of subprocess lib. [py27|https://github.com/enthought/Python-2.7.3/blob/master/Lib/subprocess.py#L1286] doesn't do any lock while [py3|https://github.com/python/cpython/blob/3.7/Lib/subprocess.py#L1592] locks when waiting. Py3 added locks at other places of Popen() as well, all unlocked places with py2 may contribute to the problem. We can add a lock when calling Popen() to prevent deadlock. {{[subprocess.Popen|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203]}} {{[subprocess.Popen()|http://atlassian.com/]}} ) > FnApi with Subprocess runner hangs frequently when running with multi workers > with py2 > -- > > Key: BEAM-7873 > URL: https://issues.apache.org/jira/browse/BEAM-7873 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Hannah Jiang >Assignee: Hannah Jiang >Priority: Major > Fix For: 2.15.0 > > > Pipeline hangs at > [subprocess.Popen()|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203] > when shut it down. I looked into source code of subprocess lib. > [py27|https://github.com/enthought/Python-2.7.3/blob/master/Lib/subprocess.py#L1286] > doesn't do any lock while > [py3|https://github.com/python/cpython/blob/3.7/Lib/subprocess.py#L1592] > locks when waiting. Py3 added locks at other places of Popen() as well, all > unlocked places with py2 may contribute to the problem. We can add a lock > when calling Popen() to prevent the deadlock. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (BEAM-7873) FnApi with Subprocess runner hangs frequently when running with multi workers with py2
[ https://issues.apache.org/jira/browse/BEAM-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hannah Jiang updated BEAM-7873: --- Description: Pipeline hangs at [Popen()|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203]]when shut it down. I looked into source code of subprocess lib. [py27|https://github.com/enthought/Python-2.7.3/blob/master/Lib/subprocess.py#L1286] doesn't do any lock while [py3|https://github.com/python/cpython/blob/3.7/Lib/subprocess.py#L1592] locks when waiting. Py3 added locks at other places of Popen() as well, all unlocked places with py2 may contribute to the problem. We can add a lock when calling Popen() to prevent deadlock. {{[subprocess.Popen|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203]]}} {{[subprocess.Popen()|http://atlassian.com]}} was: Pipeline hangs at [Popen()|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203]]when shut it down. I looked into source code of subprocess lib. [py27|https://github.com/enthought/Python-2.7.3/blob/master/Lib/subprocess.py#L1286] doesn't do any lock while [py3|https://github.com/python/cpython/blob/3.7/Lib/subprocess.py#L1592] locks when waiting. Py3 added locks at other places of Popen() as well, all unlocked places with py2 may contribute to the problem. We can add a lock when calling Popen() to prevent deadlock. {{[subprocess.Popen|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203]]}} > FnApi with Subprocess runner hangs frequently when running with multi workers > with py2 > -- > > Key: BEAM-7873 > URL: https://issues.apache.org/jira/browse/BEAM-7873 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Hannah Jiang >Assignee: Hannah Jiang >Priority: Major > Fix For: 2.15.0 > > > Pipeline hangs at > [Popen()|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203]]when > shut it down. I looked into source code of subprocess lib. > [py27|https://github.com/enthought/Python-2.7.3/blob/master/Lib/subprocess.py#L1286] > doesn't do any lock while > [py3|https://github.com/python/cpython/blob/3.7/Lib/subprocess.py#L1592] > locks when waiting. Py3 added locks at other places of Popen() as well, all > unlocked places with py2 may contribute to the problem. We can add a lock > when calling Popen() to prevent deadlock. > {{[subprocess.Popen|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203]]}} > {{[subprocess.Popen()|http://atlassian.com]}} > -- This message was sent by Atlassian JIRA (v7.6.14#76016)