[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288365=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288365
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 03/Aug/19 02:21
Start Date: 03/Aug/19 02:21
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #9239: [BEAM-7060] Fix 
:sdks:python:apache_beam:testing:load_tests:run breakage
URL: https://github.com/apache/beam/pull/9239#issuecomment-517886877
 
 
   Run Python Load Tests Smoke
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288365)
Time Spent: 9h  (was: 8h 50m)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288366=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288366
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 03/Aug/19 02:21
Start Date: 03/Aug/19 02:21
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #9239: [BEAM-7060] Fix 
:sdks:python:apache_beam:testing:load_tests:run breakage
URL: https://github.com/apache/beam/pull/9239#issuecomment-517886896
 
 
   Run Java Load Tests Smoke
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288366)
Time Spent: 9h 10m  (was: 9h)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288363=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288363
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 03/Aug/19 02:20
Start Date: 03/Aug/19 02:20
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #9239: [BEAM-7060] Fix 
:sdks:python:apache_beam:testing:load_tests:run breakage
URL: https://github.com/apache/beam/pull/9239#issuecomment-517886840
 
 
   R: @aaltay 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288363)
Time Spent: 8h 40m  (was: 8.5h)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288364=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288364
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 03/Aug/19 02:20
Start Date: 03/Aug/19 02:20
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #9239: [BEAM-7060] Fix 
:sdks:python:apache_beam:testing:load_tests:run breakage
URL: https://github.com/apache/beam/pull/9239#issuecomment-517886867
 
 
   Run Java Load Tests Smoke
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288364)
Time Spent: 8h 50m  (was: 8h 40m)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288362=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288362
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 03/Aug/19 02:18
Start Date: 03/Aug/19 02:18
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #9239: [BEAM-7060] Fix 
:sdks:python:apache_beam:testing:load_tests:run breakage
URL: https://github.com/apache/beam/pull/9239#issuecomment-517886706
 
 
   run java precommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288362)
Time Spent: 8.5h  (was: 8h 20m)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 8.5h
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7860) v1new ReadFromDatastore returns duplicates if keys are of mixed types

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7860?focusedWorklogId=288360=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288360
 ]

ASF GitHub Bot logged work on BEAM-7860:


Author: ASF GitHub Bot
Created on: 03/Aug/19 02:15
Start Date: 03/Aug/19 02:15
Worklog Time Spent: 10m 
  Work Description: udim commented on pull request #9240: [BEAM-7860] 
Python Datastore: fix key sort order
URL: https://github.com/apache/beam/pull/9240
 
 
   This is a regression from the v1 client.
   
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build
 

[jira] [Work logged] (BEAM-7860) v1new ReadFromDatastore returns duplicates if keys are of mixed types

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7860?focusedWorklogId=288361=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288361
 ]

ASF GitHub Bot logged work on BEAM-7860:


Author: ASF GitHub Bot
Created on: 03/Aug/19 02:15
Start Date: 03/Aug/19 02:15
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #9240: [BEAM-7860] Python 
Datastore: fix key sort order
URL: https://github.com/apache/beam/pull/9240#issuecomment-517886543
 
 
   Run Java Load Tests Smoke
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288361)
Time Spent: 20m  (was: 10m)

> v1new ReadFromDatastore returns duplicates if keys are of mixed types
> -
>
> Key: BEAM-7860
> URL: https://issues.apache.org/jira/browse/BEAM-7860
> Project: Beam
>  Issue Type: Bug
>  Components: io-python-gcp
>Affects Versions: 2.13.0
> Environment: Python 2.7
> Python 3.7
>Reporter: Niels Stender
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In the presence of mixed type keys, v1new ReadFromDatastore may return 
> duplicate items. The attached example returns 4 records, not the expected 3.
>  
> {code:java}
> // code placeholder
> from __future__ import unicode_literals
> import apache_beam as beam
> from apache_beam.io.gcp.datastore.v1new.types import Key, Entity, Query
> from apache_beam.io.gcp.datastore.v1new import datastoreio
> config = dict(project='your-google-project', namespace='test')
> def test_mixed():
> keys = [
> Key(['mixed', '10038260-iperm_eservice'], **config),
> Key(['mixed', 4812224868188160], **config),
> Key(['mixed', '99152975-pointshop'], **config)
> ]
> entities = map(lambda key: Entity(key=key), keys)
> with beam.Pipeline() as p:
> (p
> | beam.Create(entities)
> | datastoreio.WriteToDatastore(project=config['project'])
> )
> query = Query(kind='mixed', **config)
> with beam.Pipeline() as p:
> (p
> | datastoreio.ReadFromDatastore(query=query, num_splits=4)
> | beam.io.WriteToText('tmp.txt', num_shards=1, 
> shard_name_template='')
> )
> items = open('tmp.txt').read().strip().split('\n')
> assert len(items) == 3, 'incorrect number of items'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7819) PubsubMessage message parsing is lacking non-attribute fields

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7819?focusedWorklogId=288357=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288357
 ]

ASF GitHub Bot logged work on BEAM-7819:


Author: ASF GitHub Bot
Created on: 03/Aug/19 01:49
Start Date: 03/Aug/19 01:49
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9232: [BEAM-7819] 
Python - parse PubSub message_id into attributes property
URL: https://github.com/apache/beam/pull/9232#discussion_r310334401
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/pubsub.py
 ##
 @@ -127,6 +127,10 @@ def _from_message(msg):
 """
 # Convert ScalarMapContainer to dict.
 attributes = dict((key, msg.attributes[key]) for key in msg.attributes)
+# Parse the PubSub message_id and add to attributes
+if msg.message_id != None:
 
 Review comment:
   Could message_id be None?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288357)
Time Spent: 0.5h  (was: 20m)

> PubsubMessage message parsing is lacking non-attribute fields
> -
>
> Key: BEAM-7819
> URL: https://issues.apache.org/jira/browse/BEAM-7819
> Project: Beam
>  Issue Type: Bug
>  Components: io-python-gcp
>Reporter: Ahmet Altay
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> User reported issue: 
> https://lists.apache.org/thread.html/139b0c15abc6471a2e2202d76d915c645a529a23ecc32cd9cfecd315@%3Cuser.beam.apache.org%3E
> """
> Looking at the source code, with my untrained python eyes, I think if the 
> intention is to include the message id and the publish time in the attributes 
> attribute of the PubSubMessage type, then the protobuf mapping is missing 
> something:-
> @staticmethod
> def _from_proto_str(proto_msg):
> """Construct from serialized form of ``PubsubMessage``.
> Args:
> proto_msg: String containing a serialized protobuf of type
> https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage
> Returns:
> A new PubsubMessage object.
> """
> msg = pubsub.types.pubsub_pb2.PubsubMessage()
> msg.ParseFromString(proto_msg)
> # Convert ScalarMapContainer to dict.
> attributes = dict((key, msg.attributes[key]) for key in msg.attributes)
> return PubsubMessage(msg.data, attributes)
> The protobuf definition is here:-
> https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage
> and so it looks as if the message_id and publish_time are not being parsed as 
> they are seperate from the attributes. Perhaps the PubsubMessage class needs 
> expanding to include these as attributes, or they would need adding to the 
> dictionary for attributes. This would only need doing for the _from_proto_str 
> as obviously they would not need to be populated when transmitting a message 
> to PubSub.
> My python is not great, I'm assuming the latter option would need to look 
> something like this?
> attributes = dict((key, msg.attributes[key]) for key in msg.attributes)
> attributes.update({'message_id': msg.message_id, 'publish_time': 
> msg.publish_time})
> return PubsubMessage(msg.data, attributes)
> """



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (BEAM-7888) Test Multi Process Direct Runner With Largish Data

2019-08-02 Thread Ahmet Altay (JIRA)
Ahmet Altay created BEAM-7888:
-

 Summary: Test Multi Process Direct Runner With Largish Data
 Key: BEAM-7888
 URL: https://issues.apache.org/jira/browse/BEAM-7888
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Reporter: Ahmet Altay
Assignee: Hannah Jiang


Filing this as a tracker.

We can test multiprocess runner with a largish amount of data to the extend 
that we can do this on Jenkins. This will serve 2 purposes:
- Find out issues related to multi processing. It would be easier to find rare 
issues when running over non-trivial data.
- Serve as a baseline (if not a benchmark) to understand the limits of the 
multiprocess runner.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-7874) FnApi only supports up to 10 workers

2019-08-02 Thread yifan zou (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yifan zou updated BEAM-7874:

Priority: Blocker  (was: Major)

> FnApi only supports up to 10 workers
> 
>
> Key: BEAM-7874
> URL: https://issues.apache.org/jira/browse/BEAM-7874
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Because max_workers of grpc servers are hardcoded to 10, it only supports up 
> to 10 workers, and if we pass more direct_num_workers greater than 10, 
> pipeline hangs, because not all workers get connected to the runner.
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/fn_api_runner.py#L1141]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-7873) FnApi with Subprocess runner hangs frequently when running with multi workers with py2

2019-08-02 Thread yifan zou (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yifan zou updated BEAM-7873:

Priority: Blocker  (was: Major)

> FnApi with Subprocess runner hangs frequently when running with multi workers 
> with py2
> --
>
> Key: BEAM-7873
> URL: https://issues.apache.org/jira/browse/BEAM-7873
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Blocker
> Fix For: 2.15.0
>
>
> Pipeline hangs at 
> [subprocess.Popen()|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203]
>  when shut it down. I looked into source code of subprocess lib. 
> [py27|https://github.com/enthought/Python-2.7.3/blob/master/Lib/subprocess.py#L1286]
>  doesn't do any lock while 
> [py3|https://github.com/python/cpython/blob/3.7/Lib/subprocess.py#L1592] 
> locks when waiting. Py3 added locks at other places of Popen() as well, all 
> unlocked places with py2 may contribute to the problem. We can add a lock 
> when calling Popen() to prevent the deadlock. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-5878) Support DoFns with Keyword-only arguments in Python 3.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5878?focusedWorklogId=288340=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288340
 ]

ASF GitHub Bot logged work on BEAM-5878:


Author: ASF GitHub Bot
Created on: 03/Aug/19 00:28
Start Date: 03/Aug/19 00:28
Worklog Time Spent: 10m 
  Work Description: aaltay commented on issue #9237: [BEAM-5878] support 
DoFns with Keyword-only arguments
URL: https://github.com/apache/beam/pull/9237#issuecomment-517878579
 
 
   R: @tvalentyn 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288340)
Time Spent: 5h 20m  (was: 5h 10m)

> Support DoFns with Keyword-only arguments in Python 3.
> --
>
> Key: BEAM-5878
> URL: https://issues.apache.org/jira/browse/BEAM-5878
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: yoshiki obata
>Priority: Minor
> Fix For: 2.16.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Python 3.0 [adds a possibility|https://www.python.org/dev/peps/pep-3102/] to 
> define functions with keyword-only arguments. 
> Currently Beam does not handle them correctly. [~ruoyu] pointed out [one 
> place|https://github.com/apache/beam/blob/a56ce43109c97c739fa08adca45528c41e3c925c/sdks/python/apache_beam/typehints/decorators.py#L118]
>  in our codebase that we should fix: in Python in 3.0 inspect.getargspec() 
> will fail on functions with keyword-only arguments, but a new method 
> [inspect.getfullargspec()|https://docs.python.org/3/library/inspect.html#inspect.getfullargspec]
>  supports them.
> There may be implications for our (best-effort) type-hints machinery.
> We should also add a Py3-only unit tests that covers DoFn's with keyword-only 
> arguments once Beam Python 3 tests are in a good shape.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7678) typehints with_output_types annotation doesn't work for stateful DoFn

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7678?focusedWorklogId=288339=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288339
 ]

ASF GitHub Bot logged work on BEAM-7678:


Author: ASF GitHub Bot
Created on: 03/Aug/19 00:21
Start Date: 03/Aug/19 00:21
Worklog Time Spent: 10m 
  Work Description: ecanzonieri commented on pull request #9238: 
[BEAM-7678] Fixes bug in output element_type generation in Kv PipelineVisitor
URL: https://github.com/apache/beam/pull/9238
 
 
   This review tries to address BEAM-7678. When we use typehints in the output 
the `element_type` is already defined so in theory there is no need to try to 
infer it. There are some types that cannot be inferred by the function 
`infer_output_type`, in these cases the typehint should be used to bypass the 
inferred type.
   
   I'm not sure how to test this code, I can reproduce the issue in a manual 
test, using as type a class that for some reason the `infer_output_type` is not 
able to infer.
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [x] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 

[jira] [Work logged] (BEAM-5878) Support DoFns with Keyword-only arguments in Python 3.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5878?focusedWorklogId=288338=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288338
 ]

ASF GitHub Bot logged work on BEAM-5878:


Author: ASF GitHub Bot
Created on: 03/Aug/19 00:19
Start Date: 03/Aug/19 00:19
Worklog Time Spent: 10m 
  Work Description: lazylynx commented on pull request #9237: [BEAM-5878] 
support DoFns with Keyword-only arguments
URL: https://github.com/apache/beam/pull/9237
 
 
   support DoFns with Keyword-only arguments in Python 3 and add test reverted 
in #8750 with no syntax error in Python 2
   
   tests condition are fixed as follows due to errors:
   
   # test_side_input_keyword_only_args
   
   ```
   result2 = pcol | 'compute2' >> beam.FlatMap(
 sort_with_side_inputs,
 beam.pvalue.AsIter(side))
   ```
   to
   ```
   result2 = pcol | 'compute2' >> beam.FlatMap(
 sort_with_side_inputs,
 beam.pvalue.AsList(side))
   ```
   
   # test_combine_keyword_only_args
   
   ```
   assert_that(result2, equal_to([23]), label='assert2')
   ```
   to
   ```
   assert_that(result2, equal_to([49]), label='assert2')
   ```
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 

[jira] [Work logged] (BEAM-7667) report GCS throttling time to Dataflow autoscaler

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7667?focusedWorklogId=288334=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288334
 ]

ASF GitHub Bot logged work on BEAM-7667:


Author: ASF GitHub Bot
Created on: 02/Aug/19 23:57
Start Date: 02/Aug/19 23:57
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #8973: 
[BEAM-7667] report GCS throttling time to Dataflow autoscaler
URL: https://github.com/apache/beam/pull/8973#discussion_r310327778
 
 

 ##
 File path: 
sdks/python/apache_beam/io/gcp/internal/clients/storage/storage_v1_client.py
 ##
 @@ -20,10 +20,17 @@
 
 from __future__ import absolute_import
 
+import logging
+import time
 
 Review comment:
   I believe this class is auto-generated. So this change might get reverted 
next time we update the client.
   
   We should at least add a test that prevents this and move most of the 
changes here to another file (prob. everything other than "retry_func=xxx".
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288334)
Time Spent: 1.5h  (was: 1h 20m)

> report GCS throttling time to Dataflow autoscaler
> -
>
> Key: BEAM-7667
> URL: https://issues.apache.org/jira/browse/BEAM-7667
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> report GCS throttling time to Dataflow autoscaler.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (BEAM-7860) v1new ReadFromDatastore returns duplicates if keys are of mixed types

2019-08-02 Thread Udi Meiri (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899271#comment-16899271
 ] 

Udi Meiri edited comment on BEAM-7860 at 8/2/19 11:51 PM:
--

Verified that this happens in both v1 and v1new.

Edit: not verified in v1 after all (I thought I could reproduce in a unit test, 
but I had a bug).
I did recreate the issue in v1new using the sample code.


was (Author: udim):
Verified that this happens in both v1 and v1new.

> v1new ReadFromDatastore returns duplicates if keys are of mixed types
> -
>
> Key: BEAM-7860
> URL: https://issues.apache.org/jira/browse/BEAM-7860
> Project: Beam
>  Issue Type: Bug
>  Components: io-python-gcp
>Affects Versions: 2.13.0
> Environment: Python 2.7
> Python 3.7
>Reporter: Niels Stender
>Assignee: Udi Meiri
>Priority: Major
>
> In the presence of mixed type keys, v1new ReadFromDatastore may return 
> duplicate items. The attached example returns 4 records, not the expected 3.
>  
> {code:java}
> // code placeholder
> from __future__ import unicode_literals
> import apache_beam as beam
> from apache_beam.io.gcp.datastore.v1new.types import Key, Entity, Query
> from apache_beam.io.gcp.datastore.v1new import datastoreio
> config = dict(project='your-google-project', namespace='test')
> def test_mixed():
> keys = [
> Key(['mixed', '10038260-iperm_eservice'], **config),
> Key(['mixed', 4812224868188160], **config),
> Key(['mixed', '99152975-pointshop'], **config)
> ]
> entities = map(lambda key: Entity(key=key), keys)
> with beam.Pipeline() as p:
> (p
> | beam.Create(entities)
> | datastoreio.WriteToDatastore(project=config['project'])
> )
> query = Query(kind='mixed', **config)
> with beam.Pipeline() as p:
> (p
> | datastoreio.ReadFromDatastore(query=query, num_splits=4)
> | beam.io.WriteToText('tmp.txt', num_shards=1, 
> shard_name_template='')
> )
> items = open('tmp.txt').read().strip().split('\n')
> assert len(items) == 3, 'incorrect number of items'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7877) Change the log level when deleting unknown temporary files in FileBasedSink

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7877?focusedWorklogId=288333=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288333
 ]

ASF GitHub Bot logged work on BEAM-7877:


Author: ASF GitHub Bot
Created on: 02/Aug/19 23:51
Start Date: 02/Aug/19 23:51
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9227: [BEAM-7877] 
Change the log level when deleting unknown temoprary files
URL: https://github.com/apache/beam/pull/9227
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288333)
Time Spent: 1h  (was: 50m)

> Change the log level when deleting unknown temporary files in FileBasedSink
> ---
>
> Key: BEAM-7877
> URL: https://issues.apache.org/jira/browse/BEAM-7877
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently the log level is info. A new proposed log level is warning since 
> deleting unknown temporary files is a bad sign and sometimes leads to data 
> loss.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7744) LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7744?focusedWorklogId=288330=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288330
 ]

ASF GitHub Bot logged work on BEAM-7744:


Author: ASF GitHub Bot
Created on: 02/Aug/19 23:50
Start Date: 02/Aug/19 23:50
Worklog Time Spent: 10m 
  Work Description: aaltay commented on issue #9071: [BEAM-7744] LTS 
backport: Temporary directory for WriteOperation may not be unique in 
FileBaseSink
URL: https://github.com/apache/beam/pull/9071#issuecomment-517874112
 
 
   I am guessing the other error for JavaPortabilityApi test is fine because 
the error is "14:13:05 Task 'javaPreCommitPortabilityApi' not found in root 
project 'beam'." It does not look like test exists in this branch.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288330)
Time Spent: 3h  (was: 2h 50m)

> LTS backport: Temporary directory for WriteOperation may not be unique in 
> FileBaseSink
> --
>
> Key: BEAM-7744
> URL: https://issues.apache.org/jira/browse/BEAM-7744
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-files
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
> Fix For: 2.7.1
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Tracking BEAM-7689 LTS backport for 2.7.1 release



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7744) LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7744?focusedWorklogId=288332=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288332
 ]

ASF GitHub Bot logged work on BEAM-7744:


Author: ASF GitHub Bot
Created on: 02/Aug/19 23:50
Start Date: 02/Aug/19 23:50
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9071: [BEAM-7744] LTS 
backport: Temporary directory for WriteOperation may not be unique in 
FileBaseSink
URL: https://github.com/apache/beam/pull/9071
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288332)
Time Spent: 3h 10m  (was: 3h)

> LTS backport: Temporary directory for WriteOperation may not be unique in 
> FileBaseSink
> --
>
> Key: BEAM-7744
> URL: https://issues.apache.org/jira/browse/BEAM-7744
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-files
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
> Fix For: 2.7.1
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Tracking BEAM-7689 LTS backport for 2.7.1 release



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7860) v1new ReadFromDatastore returns duplicates if keys are of mixed types

2019-08-02 Thread Udi Meiri (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899273#comment-16899273
 ] 

Udi Meiri commented on BEAM-7860:
-

I should be possible to work around this by setting num_splits=1 in 
ReadFromDatastore. You'll get a SplitNotPossibleError and performance will 
suffer (depending on the size of the result), but it'll be correct.

> v1new ReadFromDatastore returns duplicates if keys are of mixed types
> -
>
> Key: BEAM-7860
> URL: https://issues.apache.org/jira/browse/BEAM-7860
> Project: Beam
>  Issue Type: Bug
>  Components: io-python-gcp
>Affects Versions: 2.13.0
> Environment: Python 2.7
> Python 3.7
>Reporter: Niels Stender
>Assignee: Udi Meiri
>Priority: Major
>
> In the presence of mixed type keys, v1new ReadFromDatastore may return 
> duplicate items. The attached example returns 4 records, not the expected 3.
>  
> {code:java}
> // code placeholder
> from __future__ import unicode_literals
> import apache_beam as beam
> from apache_beam.io.gcp.datastore.v1new.types import Key, Entity, Query
> from apache_beam.io.gcp.datastore.v1new import datastoreio
> config = dict(project='your-google-project', namespace='test')
> def test_mixed():
> keys = [
> Key(['mixed', '10038260-iperm_eservice'], **config),
> Key(['mixed', 4812224868188160], **config),
> Key(['mixed', '99152975-pointshop'], **config)
> ]
> entities = map(lambda key: Entity(key=key), keys)
> with beam.Pipeline() as p:
> (p
> | beam.Create(entities)
> | datastoreio.WriteToDatastore(project=config['project'])
> )
> query = Query(kind='mixed', **config)
> with beam.Pipeline() as p:
> (p
> | datastoreio.ReadFromDatastore(query=query, num_splits=4)
> | beam.io.WriteToText('tmp.txt', num_shards=1, 
> shard_name_template='')
> )
> items = open('tmp.txt').read().strip().split('\n')
> assert len(items) == 3, 'incorrect number of items'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (BEAM-7860) v1new ReadFromDatastore returns duplicates if keys are of mixed types

2019-08-02 Thread Udi Meiri (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899273#comment-16899273
 ] 

Udi Meiri edited comment on BEAM-7860 at 8/2/19 11:18 PM:
--

It should be possible to work around this by setting num_splits=1 in 
ReadFromDatastore. You'll get a SplitNotPossibleError and performance will 
suffer (depending on the size of the result), but it'll be correct.


was (Author: udim):
I should be possible to work around this by setting num_splits=1 in 
ReadFromDatastore. You'll get a SplitNotPossibleError and performance will 
suffer (depending on the size of the result), but it'll be correct.

> v1new ReadFromDatastore returns duplicates if keys are of mixed types
> -
>
> Key: BEAM-7860
> URL: https://issues.apache.org/jira/browse/BEAM-7860
> Project: Beam
>  Issue Type: Bug
>  Components: io-python-gcp
>Affects Versions: 2.13.0
> Environment: Python 2.7
> Python 3.7
>Reporter: Niels Stender
>Assignee: Udi Meiri
>Priority: Major
>
> In the presence of mixed type keys, v1new ReadFromDatastore may return 
> duplicate items. The attached example returns 4 records, not the expected 3.
>  
> {code:java}
> // code placeholder
> from __future__ import unicode_literals
> import apache_beam as beam
> from apache_beam.io.gcp.datastore.v1new.types import Key, Entity, Query
> from apache_beam.io.gcp.datastore.v1new import datastoreio
> config = dict(project='your-google-project', namespace='test')
> def test_mixed():
> keys = [
> Key(['mixed', '10038260-iperm_eservice'], **config),
> Key(['mixed', 4812224868188160], **config),
> Key(['mixed', '99152975-pointshop'], **config)
> ]
> entities = map(lambda key: Entity(key=key), keys)
> with beam.Pipeline() as p:
> (p
> | beam.Create(entities)
> | datastoreio.WriteToDatastore(project=config['project'])
> )
> query = Query(kind='mixed', **config)
> with beam.Pipeline() as p:
> (p
> | datastoreio.ReadFromDatastore(query=query, num_splits=4)
> | beam.io.WriteToText('tmp.txt', num_shards=1, 
> shard_name_template='')
> )
> items = open('tmp.txt').read().strip().split('\n')
> assert len(items) == 3, 'incorrect number of items'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7860) v1new ReadFromDatastore returns duplicates if keys are of mixed types

2019-08-02 Thread Udi Meiri (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899271#comment-16899271
 ] 

Udi Meiri commented on BEAM-7860:
-

Verified that this happens in both v1 and v1new.

> v1new ReadFromDatastore returns duplicates if keys are of mixed types
> -
>
> Key: BEAM-7860
> URL: https://issues.apache.org/jira/browse/BEAM-7860
> Project: Beam
>  Issue Type: Bug
>  Components: io-python-gcp
>Affects Versions: 2.13.0
> Environment: Python 2.7
> Python 3.7
>Reporter: Niels Stender
>Assignee: Udi Meiri
>Priority: Major
>
> In the presence of mixed type keys, v1new ReadFromDatastore may return 
> duplicate items. The attached example returns 4 records, not the expected 3.
>  
> {code:java}
> // code placeholder
> from __future__ import unicode_literals
> import apache_beam as beam
> from apache_beam.io.gcp.datastore.v1new.types import Key, Entity, Query
> from apache_beam.io.gcp.datastore.v1new import datastoreio
> config = dict(project='your-google-project', namespace='test')
> def test_mixed():
> keys = [
> Key(['mixed', '10038260-iperm_eservice'], **config),
> Key(['mixed', 4812224868188160], **config),
> Key(['mixed', '99152975-pointshop'], **config)
> ]
> entities = map(lambda key: Entity(key=key), keys)
> with beam.Pipeline() as p:
> (p
> | beam.Create(entities)
> | datastoreio.WriteToDatastore(project=config['project'])
> )
> query = Query(kind='mixed', **config)
> with beam.Pipeline() as p:
> (p
> | datastoreio.ReadFromDatastore(query=query, num_splits=4)
> | beam.io.WriteToText('tmp.txt', num_shards=1, 
> shard_name_template='')
> )
> items = open('tmp.txt').read().strip().split('\n')
> assert len(items) == 3, 'incorrect number of items'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (BEAM-7887) Release automation improvement

2019-08-02 Thread Mark Liu (JIRA)
Mark Liu created BEAM-7887:
--

 Summary: Release automation improvement
 Key: BEAM-7887
 URL: https://issues.apache.org/jira/browse/BEAM-7887
 Project: Beam
  Issue Type: Improvement
  Components: build-system, testing, website
Reporter: Mark Liu


This is an epic jira to track improvement work for Beam release automation. 

Main goals are:
- Address pain points during full cycle of release process in automatic way. 
- Improve existing automation tools to be more user friendly.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-5820) Vendor Calcite

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5820?focusedWorklogId=288319=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288319
 ]

ASF GitHub Bot logged work on BEAM-5820:


Author: ASF GitHub Bot
Created on: 02/Aug/19 23:03
Start Date: 02/Aug/19 23:03
Worklog Time Spent: 10m 
  Work Description: apilloud commented on pull request #9189: [BEAM-5820] 
[PoC] vendor calcite
URL: https://github.com/apache/beam/pull/9189#discussion_r310321188
 
 

 ##
 File path: vendor/calcite-1_19_0/build.gradle
 ##
 @@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+plugins { id 'org.apache.beam.vendor-java' }
+
+description = "Apache Beam :: Vendored Dependencies :: Calcite 1.19.0"
+
+group = "org.apache.beam"
+version = "0.1"
+
+def calcite_version = "1.19.0"
+def avatica_version = "1.15.0"
+def prefix = "org.apache.beam.vendor.calcite.v1_19_0"
+
+vendorJava(
+  dependencies: [
+"org.apache.calcite:calcite-core:$calcite_version",
+"org.apache.calcite:calcite-linq4j:$calcite_version",
+"org.apache.calcite.avatica:avatica-core:$avatica_version",
+  ],
+  relocations: [
+"org.apache.calcite": "${prefix}.org.apache.calcite",
+
+// Calcite has Guava on its API surface
+"com.google.common": 
"org.apache.beam.vendor.guava.v26_0_jre.com.google.thirdparty",
 
 Review comment:
   I don't understand this. Why are we moving "com.google.common" to 
"com.google.thirdparty"? This does not match guava: 
https://github.com/apache/beam/blob/f7cbf88f550c8918b99a13af4182d6efa07cd2b5/vendor/guava-26_0-jre/build.gradle#L29
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288319)
Time Spent: 20m  (was: 10m)

> Vendor Calcite
> --
>
> Key: BEAM-5820
> URL: https://issues.apache.org/jira/browse/BEAM-5820
> Project: Beam
>  Issue Type: Sub-task
>  Components: dsl-sql
>Reporter: Kenneth Knowles
>Assignee: Kai Jiang
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288318=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288318
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 02/Aug/19 23:01
Start Date: 02/Aug/19 23:01
Worklog Time Spent: 10m 
  Work Description: udim commented on pull request #9223: [BEAM-7060] 
Introduce Python3-only test modules
URL: https://github.com/apache/beam/pull/9223
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288318)
Time Spent: 8h 20m  (was: 8h 10m)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7832) ZetaSQL Dialect

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7832?focusedWorklogId=288314=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288314
 ]

ASF GitHub Bot logged work on BEAM-7832:


Author: ASF GitHub Bot
Created on: 02/Aug/19 22:57
Start Date: 02/Aug/19 22:57
Worklog Time Spent: 10m 
  Work Description: apilloud commented on issue #9210: [BEAM-7832] Add 
ZetaSQL as a dialect in BeamSQL
URL: https://github.com/apache/beam/pull/9210#issuecomment-517866543
 
 
   The grpc issue might be a ZetaSQL bug, I'll work with you to get an isolated 
repro on Monday. As for moving this to a separate package, I would consider 
that a blocker to this merging. ZetaSQL currently only works on a limited 
subset of linux machines, we don't want to break other users and developers. I 
see #9189 is waiting for my review, so I'm going to do that now to help move 
this forward.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288314)
Time Spent: 3h 10m  (was: 3h)

> ZetaSQL Dialect
> ---
>
> Key: BEAM-7832
> URL: https://issues.apache.org/jira/browse/BEAM-7832
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> We can support ZetaSQL(https://github.com/google/zetasql) dialect in BeamSQL. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288313=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288313
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 02/Aug/19 22:55
Start Date: 02/Aug/19 22:55
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on issue #9223: [BEAM-7060] 
Introduce Python3-only test modules
URL: https://github.com/apache/beam/pull/9223#issuecomment-517866303
 
 
   LGTM. Thanks.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288313)
Time Spent: 8h 10m  (was: 8h)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288311=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288311
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 02/Aug/19 22:51
Start Date: 02/Aug/19 22:51
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #9223: [BEAM-7060] Introduce 
Python3-only test modules
URL: https://github.com/apache/beam/pull/9223#issuecomment-517865810
 
 
   Error in postcommit is a transient pip issue
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288311)
Time Spent: 8h  (was: 7h 50m)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7744) LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7744?focusedWorklogId=288305=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288305
 ]

ASF GitHub Bot logged work on BEAM-7744:


Author: ASF GitHub Bot
Created on: 02/Aug/19 22:40
Start Date: 02/Aug/19 22:40
Worklog Time Spent: 10m 
  Work Description: ihji commented on issue #9071: [BEAM-7744] LTS 
backport: Temporary directory for WriteOperation may not be unique in 
FileBaseSink
URL: https://github.com/apache/beam/pull/9071#issuecomment-517863870
 
 
   @kennknowles @aaltay Java precommit passed. PTAL. Thanks!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288305)
Time Spent: 2h 50m  (was: 2h 40m)

> LTS backport: Temporary directory for WriteOperation may not be unique in 
> FileBaseSink
> --
>
> Key: BEAM-7744
> URL: https://issues.apache.org/jira/browse/BEAM-7744
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-files
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
> Fix For: 2.7.1
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Tracking BEAM-7689 LTS backport for 2.7.1 release



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7877) Change the log level when deleting unknown temporary files in FileBasedSink

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7877?focusedWorklogId=288304=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288304
 ]

ASF GitHub Bot logged work on BEAM-7877:


Author: ASF GitHub Bot
Created on: 02/Aug/19 22:38
Start Date: 02/Aug/19 22:38
Worklog Time Spent: 10m 
  Work Description: ihji commented on issue #9227: [BEAM-7877] Change the 
log level when deleting unknown temoprary files
URL: https://github.com/apache/beam/pull/9227#issuecomment-517863575
 
 
   run java precommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288304)
Time Spent: 50m  (was: 40m)

> Change the log level when deleting unknown temporary files in FileBasedSink
> ---
>
> Key: BEAM-7877
> URL: https://issues.apache.org/jira/browse/BEAM-7877
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Currently the log level is info. A new proposed log level is warning since 
> deleting unknown temporary files is a bad sign and sometimes leads to data 
> loss.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7832) ZetaSQL Dialect

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7832?focusedWorklogId=288303=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288303
 ]

ASF GitHub Bot logged work on BEAM-7832:


Author: ASF GitHub Bot
Created on: 02/Aug/19 22:30
Start Date: 02/Aug/19 22:30
Worklog Time Spent: 10m 
  Work Description: amaliujia commented on issue #9210: [BEAM-7832] Add 
ZetaSQL as a dialect in BeamSQL
URL: https://github.com/apache/beam/pull/9210#issuecomment-517862226
 
 
   During the process of moving zetasql planner to another package, I also 
experienced missing grpc dependency. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288303)
Time Spent: 3h  (was: 2h 50m)

> ZetaSQL Dialect
> ---
>
> Key: BEAM-7832
> URL: https://issues.apache.org/jira/browse/BEAM-7832
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> We can support ZetaSQL(https://github.com/google/zetasql) dialect in BeamSQL. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (BEAM-7886) Make row coder a standard coder and implement in python

2019-08-02 Thread Brian Hulette (JIRA)
Brian Hulette created BEAM-7886:
---

 Summary: Make row coder a standard coder and implement in python
 Key: BEAM-7886
 URL: https://issues.apache.org/jira/browse/BEAM-7886
 Project: Beam
  Issue Type: Improvement
  Components: sdk-java-core, sdk-py-core
Reporter: Brian Hulette
Assignee: Brian Hulette






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-7886) Make row coder a standard coder and implement in python

2019-08-02 Thread Brian Hulette (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette updated BEAM-7886:

Component/s: beam-model

> Make row coder a standard coder and implement in python
> ---
>
> Key: BEAM-7886
> URL: https://issues.apache.org/jira/browse/BEAM-7886
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model, sdk-java-core, sdk-py-core
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-6674) The JdbcIO source should produce schemas

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6674?focusedWorklogId=288298=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288298
 ]

ASF GitHub Bot logged work on BEAM-6674:


Author: ASF GitHub Bot
Created on: 02/Aug/19 21:58
Start Date: 02/Aug/19 21:58
Worklog Time Spent: 10m 
  Work Description: terekete commented on issue #8725: [BEAM-6674] Add 
schema support to JdbcIO read
URL: https://github.com/apache/beam/pull/8725#issuecomment-517855951
 
 
   Hey all -  I was wondering if there is an example or docs on how to use the 
the new implementation to read rows using the Row Coder and Row Mapper and 
auto-infer the beam schema ?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288298)
Time Spent: 4h 10m  (was: 4h)

> The JdbcIO source should produce schemas
> 
>
> Key: BEAM-6674
> URL: https://issues.apache.org/jira/browse/BEAM-6674
> Project: Beam
>  Issue Type: Sub-task
>  Components: io-java-jdbc
>Reporter: Reuven Lax
>Assignee: Charith Ellawala
>Priority: Major
> Fix For: 2.14.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (BEAM-7788) Go modules breaking precommits?

2019-08-02 Thread Udi Meiri (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udi Meiri resolved BEAM-7788.
-
   Resolution: Fixed
Fix Version/s: Not applicable

> Go modules breaking precommits?
> ---
>
> Key: BEAM-7788
> URL: https://issues.apache.org/jira/browse/BEAM-7788
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-go, testing
>Reporter: Udi Meiri
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Lots of errors like:
> {code}
> 01:32:36  > git clean -fdx # timeout=10
> 01:32:38 FATAL: Command "git clean -fdx" returned status code 1:
> 01:32:38 stdout: 
> 01:32:38 stderr: warning: failed to remove 
> sdks/go/.gogradle/project_gopath/pkg/mod/go.opencensus.io@v0.22.0/zpages/example_test.go
> ... [lots of files listed]
> {code}
> Examples:
> https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Commit/3886/consoleFull
> https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Commit/3884/consoleFull
> https://builds.apache.org/job/beam_PreCommit_JavaPortabilityApi_Commit/3790/consoleFull
> https://builds.apache.org/job/beam_PreCommit_JavaPortabilityApi_Commit/3788/consoleFull
> cc: [~lostluck]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7777) Stream Cost Model and RelNode Cost Estimation

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-?focusedWorklogId=288274=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288274
 ]

ASF GitHub Bot logged work on BEAM-:


Author: ASF GitHub Bot
Created on: 02/Aug/19 21:29
Start Date: 02/Aug/19 21:29
Worklog Time Spent: 10m 
  Work Description: akedin commented on pull request #9217: [BEAM-] 
Beam cost model
URL: https://github.com/apache/beam/pull/9217#discussion_r310303783
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamCostModelTest.java
 ##
 @@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.sql.impl.planner;
+
+import org.junit.Assert;
+import org.junit.Test;
+
+/** Tests the behavior of BeamCostModel. */
+public class BeamCostModelTest {
+
+  // We should activate it when all the nodes are implementing the cost model.
+  //  @Test
 
 Review comment:
   What's left to implement?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288274)
Time Spent: 50m  (was: 40m)

> Stream Cost Model and RelNode Cost Estimation
> -
>
> Key: BEAM-
> URL: https://issues.apache.org/jira/browse/BEAM-
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Alireza Samadianzakaria
>Assignee: Alireza Samadianzakaria
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Currently the cost model is not suitable for streaming jobs. (it uses row 
> count and for estimating the output row count of each node it is using 
> calcite estimations that are only applicable to bounded data.)
> We need to implement a new cost model and implement cost estimation for all 
> of our physical nodes. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7777) Stream Cost Model and RelNode Cost Estimation

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-?focusedWorklogId=288273=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288273
 ]

ASF GitHub Bot logged work on BEAM-:


Author: ASF GitHub Bot
Created on: 02/Aug/19 21:26
Start Date: 02/Aug/19 21:26
Worklog Time Spent: 10m 
  Work Description: akedin commented on issue #9217: [BEAM-] Beam cost 
model
URL: https://github.com/apache/beam/pull/9217#issuecomment-517849128
 
 
   r: @kennknowles @kanterov 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288273)
Time Spent: 40m  (was: 0.5h)

> Stream Cost Model and RelNode Cost Estimation
> -
>
> Key: BEAM-
> URL: https://issues.apache.org/jira/browse/BEAM-
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Alireza Samadianzakaria
>Assignee: Alireza Samadianzakaria
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently the cost model is not suitable for streaming jobs. (it uses row 
> count and for estimating the output row count of each node it is using 
> calcite estimations that are only applicable to bounded data.)
> We need to implement a new cost model and implement cost estimation for all 
> of our physical nodes. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7777) Stream Cost Model and RelNode Cost Estimation

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-?focusedWorklogId=288272=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288272
 ]

ASF GitHub Bot logged work on BEAM-:


Author: ASF GitHub Bot
Created on: 02/Aug/19 21:26
Start Date: 02/Aug/19 21:26
Worklog Time Spent: 10m 
  Work Description: akedin commented on issue #9217: [BEAM-] Beam cost 
model
URL: https://github.com/apache/beam/pull/9217#issuecomment-517849012
 
 
   run sql postcommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288272)
Time Spent: 0.5h  (was: 20m)

> Stream Cost Model and RelNode Cost Estimation
> -
>
> Key: BEAM-
> URL: https://issues.apache.org/jira/browse/BEAM-
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Alireza Samadianzakaria
>Assignee: Alireza Samadianzakaria
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently the cost model is not suitable for streaming jobs. (it uses row 
> count and for estimating the output row count of each node it is using 
> calcite estimations that are only applicable to bounded data.)
> We need to implement a new cost model and implement cost estimation for all 
> of our physical nodes. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288270=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288270
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 02/Aug/19 21:25
Start Date: 02/Aug/19 21:25
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #9223: [BEAM-7060] Introduce 
Python3-only test modules
URL: https://github.com/apache/beam/pull/9223#issuecomment-517848786
 
 
   Run Python Dataflow ValidatesContainer
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288270)
Time Spent: 7h 50m  (was: 7h 40m)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7744) LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7744?focusedWorklogId=288271=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288271
 ]

ASF GitHub Bot logged work on BEAM-7744:


Author: ASF GitHub Bot
Created on: 02/Aug/19 21:25
Start Date: 02/Aug/19 21:25
Worklog Time Spent: 10m 
  Work Description: ihji commented on issue #9071: [BEAM-7744] LTS 
backport: Temporary directory for WriteOperation may not be unique in 
FileBaseSink
URL: https://github.com/apache/beam/pull/9071#issuecomment-517845800
 
 
   Run JavaPortabilityApi PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288271)
Time Spent: 2h 40m  (was: 2.5h)

> LTS backport: Temporary directory for WriteOperation may not be unique in 
> FileBaseSink
> --
>
> Key: BEAM-7744
> URL: https://issues.apache.org/jira/browse/BEAM-7744
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-files
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
> Fix For: 2.7.1
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Tracking BEAM-7689 LTS backport for 2.7.1 release



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288268=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288268
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 02/Aug/19 21:24
Start Date: 02/Aug/19 21:24
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #9223: [BEAM-7060] Introduce 
Python3-only test modules
URL: https://github.com/apache/beam/pull/9223#issuecomment-517848618
 
 
   run python 2 postcommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288268)
Time Spent: 7h 40m  (was: 7.5h)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7744) LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7744?focusedWorklogId=288262=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288262
 ]

ASF GitHub Bot logged work on BEAM-7744:


Author: ASF GitHub Bot
Created on: 02/Aug/19 21:12
Start Date: 02/Aug/19 21:12
Worklog Time Spent: 10m 
  Work Description: ihji commented on issue #9071: [BEAM-7744] LTS 
backport: Temporary directory for WriteOperation may not be unique in 
FileBaseSink
URL: https://github.com/apache/beam/pull/9071#issuecomment-517845800
 
 
   Run JavaPortabilityApi PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288262)
Time Spent: 2.5h  (was: 2h 20m)

> LTS backport: Temporary directory for WriteOperation may not be unique in 
> FileBaseSink
> --
>
> Key: BEAM-7744
> URL: https://issues.apache.org/jira/browse/BEAM-7744
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-files
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
> Fix For: 2.7.1
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Tracking BEAM-7689 LTS backport for 2.7.1 release



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7744) LTS backport: Temporary directory for WriteOperation may not be unique in FileBaseSink

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7744?focusedWorklogId=288261=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288261
 ]

ASF GitHub Bot logged work on BEAM-7744:


Author: ASF GitHub Bot
Created on: 02/Aug/19 21:12
Start Date: 02/Aug/19 21:12
Worklog Time Spent: 10m 
  Work Description: ihji commented on issue #9071: [BEAM-7744] LTS 
backport: Temporary directory for WriteOperation may not be unique in 
FileBaseSink
URL: https://github.com/apache/beam/pull/9071#issuecomment-517845765
 
 
   Run Java PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288261)
Time Spent: 2h 20m  (was: 2h 10m)

> LTS backport: Temporary directory for WriteOperation may not be unique in 
> FileBaseSink
> --
>
> Key: BEAM-7744
> URL: https://issues.apache.org/jira/browse/BEAM-7744
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-files
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
> Fix For: 2.7.1
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Tracking BEAM-7689 LTS backport for 2.7.1 release



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-562) DoFn Reuse: Add new DoFn setup and teardown to python SDK

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-562?focusedWorklogId=288259=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288259
 ]

ASF GitHub Bot logged work on BEAM-562:
---

Author: ASF GitHub Bot
Created on: 02/Aug/19 21:11
Start Date: 02/Aug/19 21:11
Worklog Time Spent: 10m 
  Work Description: NikeNano commented on issue #7994: [BEAM-562] Add 
DoFn.setup and DoFn.teardown to Python SDK
URL: https://github.com/apache/beam/pull/7994#issuecomment-517845557
 
 
   Created issues for DoFn.setup, 
https://issues.apache.org/jira/browse/BEAM-7885
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288259)
Time Spent: 11.5h  (was: 11h 20m)

> DoFn Reuse: Add new DoFn setup and teardown to python SDK
> -
>
> Key: BEAM-562
> URL: https://issues.apache.org/jira/browse/BEAM-562
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Yifan Mai
>Priority: Major
>  Labels: sdk-consistency
> Fix For: 2.14.0
>
>  Time Spent: 11.5h
>  Remaining Estimate: 0h
>
> Java SDK added setup and teardown methods to the DoFns. This makes DoFns 
> reusable and provide performance improvements. Python SDK should add support 
> for these new DoFn methods:
> Proposal doc: 
> https://docs.google.com/document/d/1LLQqggSePURt3XavKBGV7SZJYQ4NW8yCu63lBchzMRk/edit?ts=5771458f#



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7877) Change the log level when deleting unknown temporary files in FileBasedSink

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7877?focusedWorklogId=288258=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288258
 ]

ASF GitHub Bot logged work on BEAM-7877:


Author: ASF GitHub Bot
Created on: 02/Aug/19 21:10
Start Date: 02/Aug/19 21:10
Worklog Time Spent: 10m 
  Work Description: ihji commented on pull request #9227: [BEAM-7877] 
Change the log level when deleting unknown temoprary files
URL: https://github.com/apache/beam/pull/9227#discussion_r310299085
 
 

 ##
 File path: 
sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java
 ##
 @@ -791,7 +791,7 @@ final void removeTemporaryFiles(
   
FileSystems.match(Collections.singletonList(tempDir.toString() + "*")));
   for (Metadata matchResult : singleMatch.metadata()) {
 if (allMatches.add(matchResult.resourceId())) {
-  LOG.info("Will also remove unknown temporary file {}", 
matchResult.resourceId());
+  LOG.warn("Will also remove unknown temporary file {}", 
matchResult.resourceId());
 
 Review comment:
   Thanks for the comment. Updated the log message.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288258)
Time Spent: 40m  (was: 0.5h)

> Change the log level when deleting unknown temporary files in FileBasedSink
> ---
>
> Key: BEAM-7877
> URL: https://issues.apache.org/jira/browse/BEAM-7877
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently the log level is info. A new proposed log level is warning since 
> deleting unknown temporary files is a bad sign and sometimes leads to data 
> loss.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (BEAM-7885) DoFn.setup() don't run for streaming jobs.

2019-08-02 Thread niklas Hansson (JIRA)
niklas Hansson created BEAM-7885:


 Summary: DoFn.setup() don't run for streaming jobs. 
 Key: BEAM-7885
 URL: https://issues.apache.org/jira/browse/BEAM-7885
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Affects Versions: 2.14.0
 Environment: Python
Reporter: niklas Hansson


>From version 2.14.0 Python have introduced setup and teardown for DoFn in 
>order to "Called to prepare an instance for processing bundles of 
>elements.This is a good place to initialize transient in-memory resources, 
>such as network connections."

However when trying to use it for a unbounded job (pubsub source) it seams like 
the DoFn.setup() is never called and the resources are never  initialize. 
Instead I get:

 

AttributeError: 'NoneType' object has no attribute 'predict' [while running 
'transform the data']

"""

My source code: [https://github.com/NikeNano/DataflowSklearnStreaming]

 

I am happy to contribute with example code for how to use setup as soon as I 
get it running :)  

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288246=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288246
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 02/Aug/19 20:56
Start Date: 02/Aug/19 20:56
Worklog Time Spent: 10m 
  Work Description: y1chi commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r310295208
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -139,50 +143,77 @@ def __init__(self,
 self.filter = filter
 self.projection = projection
 self.spec = extra_client_params
-self.doc_count = self._get_document_count()
-self.avg_doc_size = self._get_avg_document_size()
-self.client = None
 
   def estimate_size(self):
-return self.avg_doc_size * self.doc_count
+with MongoClient(self.uri, **self.spec) as client:
+  size = client[self.db].command('collstats', self.coll).get('size')
+  if size is None or size <= 0:
+raise ValueError('Collection %s not found or total doc size is '
+ 'incorrect' % self.coll)
+  return size
 
   def split(self, desired_bundle_size, start_position=None, 
stop_position=None):
 # use document cursor index as the start and stop positions
 if start_position is None:
-  start_position = 0
+  epoch = datetime.datetime(1970, 1, 1)
+  start_position = objectid.ObjectId.from_datetime(epoch)
 if stop_position is None:
-  stop_position = self.doc_count
+  last_doc_id = self._get_last_document_id()
+  # add one sec to make sure the last document is not excluded
+  last_timestamp_plus_one_sec = (last_doc_id.generation_time +
+ datetime.timedelta(seconds=1))
+  stop_position = objectid.ObjectId.from_datetime(
+  last_timestamp_plus_one_sec)
 
-# get an estimate on how many documents should be included in a split batch
-desired_bundle_count = desired_bundle_size // self.avg_doc_size
+desired_bundle_size_in_mb = desired_bundle_size // 1024 // 1024
+split_keys = self._get_split_keys(desired_bundle_size_in_mb, 
start_position,
+  stop_position)
 
 bundle_start = start_position
-while bundle_start < stop_position:
-  bundle_end = min(stop_position, bundle_start + desired_bundle_count)
-  yield iobase.SourceBundle(weight=bundle_end - bundle_start,
+for split_key_id in split_keys:
+  bundle_end = min(stop_position, split_key_id)
+  if bundle_start is None and bundle_start < stop_position:
+return
+  yield iobase.SourceBundle(weight=desired_bundle_size_in_mb,
 source=self,
 start_position=bundle_start,
 stop_position=bundle_end)
   bundle_start = bundle_end
+# add range of last split_key to stop_position
+if bundle_start < stop_position:
+  yield iobase.SourceBundle(weight=desired_bundle_size_in_mb,
+source=self,
+start_position=bundle_start,
+stop_position=stop_position)
 
   def get_range_tracker(self, start_position, stop_position):
 if start_position is None:
-  start_position = 0
+  epoch = datetime.datetime(1970, 1, 1)
 
 Review comment:
   My understanding is when the first split happens.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288246)
Time Spent: 2h 20m  (was: 2h 10m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea 

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288245=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288245
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 02/Aug/19 20:54
Start Date: 02/Aug/19 20:54
Worklog Time Spent: 10m 
  Work Description: y1chi commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r310294780
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -139,50 +143,77 @@ def __init__(self,
 self.filter = filter
 self.projection = projection
 self.spec = extra_client_params
-self.doc_count = self._get_document_count()
-self.avg_doc_size = self._get_avg_document_size()
-self.client = None
 
   def estimate_size(self):
-return self.avg_doc_size * self.doc_count
+with MongoClient(self.uri, **self.spec) as client:
+  size = client[self.db].command('collstats', self.coll).get('size')
+  if size is None or size <= 0:
+raise ValueError('Collection %s not found or total doc size is '
+ 'incorrect' % self.coll)
+  return size
 
   def split(self, desired_bundle_size, start_position=None, 
stop_position=None):
 # use document cursor index as the start and stop positions
 if start_position is None:
-  start_position = 0
+  epoch = datetime.datetime(1970, 1, 1)
 
 Review comment:
   makes sense, thanks for the suggest.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288245)
Time Spent: 2h 10m  (was: 2h)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-562) DoFn Reuse: Add new DoFn setup and teardown to python SDK

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-562?focusedWorklogId=288243=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288243
 ]

ASF GitHub Bot logged work on BEAM-562:
---

Author: ASF GitHub Bot
Created on: 02/Aug/19 20:45
Start Date: 02/Aug/19 20:45
Worklog Time Spent: 10m 
  Work Description: NikeNano commented on issue #7994: [BEAM-562] Add 
DoFn.setup and DoFn.teardown to Python SDK
URL: https://github.com/apache/beam/pull/7994#issuecomment-517839074
 
 
   OK, thanks @aaltay. Will investigate further and file an issue if i don't 
get it to work. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288243)
Time Spent: 11h 20m  (was: 11h 10m)

> DoFn Reuse: Add new DoFn setup and teardown to python SDK
> -
>
> Key: BEAM-562
> URL: https://issues.apache.org/jira/browse/BEAM-562
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Yifan Mai
>Priority: Major
>  Labels: sdk-consistency
> Fix For: 2.14.0
>
>  Time Spent: 11h 20m
>  Remaining Estimate: 0h
>
> Java SDK added setup and teardown methods to the DoFns. This makes DoFns 
> reusable and provide performance improvements. Python SDK should add support 
> for these new DoFn methods:
> Proposal doc: 
> https://docs.google.com/document/d/1LLQqggSePURt3XavKBGV7SZJYQ4NW8yCu63lBchzMRk/edit?ts=5771458f#



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-562) DoFn Reuse: Add new DoFn setup and teardown to python SDK

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-562?focusedWorklogId=288242=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288242
 ]

ASF GitHub Bot logged work on BEAM-562:
---

Author: ASF GitHub Bot
Created on: 02/Aug/19 20:44
Start Date: 02/Aug/19 20:44
Worklog Time Spent: 10m 
  Work Description: aaltay commented on issue #7994: [BEAM-562] Add 
DoFn.setup and DoFn.teardown to Python SDK
URL: https://github.com/apache/beam/pull/7994#issuecomment-517838714
 
 
   @NikeNano it should work with all dofns regardless of the types of sources. 
If that is not working, please file an issue.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288242)
Time Spent: 11h 10m  (was: 11h)

> DoFn Reuse: Add new DoFn setup and teardown to python SDK
> -
>
> Key: BEAM-562
> URL: https://issues.apache.org/jira/browse/BEAM-562
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Yifan Mai
>Priority: Major
>  Labels: sdk-consistency
> Fix For: 2.14.0
>
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> Java SDK added setup and teardown methods to the DoFns. This makes DoFns 
> reusable and provide performance improvements. Python SDK should add support 
> for these new DoFn methods:
> Proposal doc: 
> https://docs.google.com/document/d/1LLQqggSePURt3XavKBGV7SZJYQ4NW8yCu63lBchzMRk/edit?ts=5771458f#



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-562) DoFn Reuse: Add new DoFn setup and teardown to python SDK

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-562?focusedWorklogId=288241=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288241
 ]

ASF GitHub Bot logged work on BEAM-562:
---

Author: ASF GitHub Bot
Created on: 02/Aug/19 20:40
Start Date: 02/Aug/19 20:40
Worklog Time Spent: 10m 
  Work Description: NikeNano commented on issue #7994: [BEAM-562] Add 
DoFn.setup and DoFn.teardown to Python SDK
URL: https://github.com/apache/beam/pull/7994#issuecomment-517837763
 
 
   @yifanmai is this also expected to work for unbounded sources? I don't get 
it to work when reading from pubsub
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288241)
Time Spent: 11h  (was: 10h 50m)

> DoFn Reuse: Add new DoFn setup and teardown to python SDK
> -
>
> Key: BEAM-562
> URL: https://issues.apache.org/jira/browse/BEAM-562
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Yifan Mai
>Priority: Major
>  Labels: sdk-consistency
> Fix For: 2.14.0
>
>  Time Spent: 11h
>  Remaining Estimate: 0h
>
> Java SDK added setup and teardown methods to the DoFns. This makes DoFns 
> reusable and provide performance improvements. Python SDK should add support 
> for these new DoFn methods:
> Proposal doc: 
> https://docs.google.com/document/d/1LLQqggSePURt3XavKBGV7SZJYQ4NW8yCu63lBchzMRk/edit?ts=5771458f#



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (BEAM-7878) Refine Spark runner dependencies

2019-08-02 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/BEAM-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía resolved BEAM-7878.

Resolution: Fixed

> Refine Spark runner dependencies
> 
>
> Key: BEAM-7878
> URL: https://issues.apache.org/jira/browse/BEAM-7878
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Minor
> Fix For: 2.16.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The Spark runner has more dependencies than it needs:
>  * The jackson_module_scala dependency is marked as compile but it is only 
> needed at runtime so it ends up leaking this module.
>  * The dropwizard, scala-library and commons-compress dependencies are 
> declared but unused.
>  * The Kafka dependency version is not aligned with the rest of Beam.
>  * The Kafka and zookeeper dependencies are used only for the tests but 
> currently are provided



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=288236=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288236
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 02/Aug/19 20:35
Start Date: 02/Aug/19 20:35
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r310289052
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountInitFn.java
 ##
 @@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.zetasketch;
+
+import static 
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument;
+
+import com.google.zetasketch.HyperLogLogPlusPlus;
+import com.google.zetasketch.shaded.com.google.protobuf.ByteString;
+import org.apache.beam.sdk.coders.Coder;
+import org.apache.beam.sdk.coders.CoderRegistry;
+import org.apache.beam.sdk.transforms.Combine;
+
+/**
+ * {@link Combine.CombineFn} for the {@link HllCount.Init} combiner.
+ *
+ * @param  type of input values to the function
+ * @param  type of the HLL++ sketch to compute
+ */
+abstract class HllCountInitFn
+extends Combine.CombineFn, byte[]> {
+
+  private int precision;
+
+  private HllCountInitFn() {
+setPrecision(HllCount.DEFAULT_PRECISION);
+  }
+
+  int getPrecision() {
+return precision;
+  }
+
+  void setPrecision(int precision) {
 
 Review comment:
   Done! 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288236)
Time Spent: 10h 10m  (was: 10h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 10h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=288235=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288235
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 02/Aug/19 20:34
Start Date: 02/Aug/19 20:34
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r310288860
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java
 ##
 @@ -0,0 +1,350 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.zetasketch;
+
+import com.google.zetasketch.HyperLogLogPlusPlus;
+import org.apache.beam.sdk.annotations.Experimental;
+import org.apache.beam.sdk.transforms.Combine;
+import org.apache.beam.sdk.transforms.DoFn;
+import org.apache.beam.sdk.transforms.PTransform;
+import org.apache.beam.sdk.transforms.ParDo;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+
+/**
+ * {@code PTransform}s to compute HyperLogLogPlusPlus (HLL++) sketches on data 
streams based on the
+ * https://github.com/google/zetasketch;>ZetaSketch 
implementation.
+ *
+ * HLL++ is an algorithm implemented by Google that estimates the count of 
distinct elements in a
+ * data stream. HLL++ requires significantly less memory than the linear 
memory needed for exact
+ * computation, at the cost of a small error. Cardinalities of arbitrary 
breakdowns can be computed
+ * using the HLL++ sketch. See this http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf;>published
+ * paper for details about the algorithm.
+ *
+ * HLL++ functions are also supported in https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions;>Google
 Cloud
+ * BigQuery. Using the {@code HllCount PTransform}s makes the 
interoperation with BigQuery
+ * easier.
+ *
+ * For detailed design of this class, see https://s.apache.org/hll-in-beam.
+ *
+ * Examples
+ *
+ * Example 1: Create long-type sketch for a {@code PCollection} and 
specify precision
+ *
+ * {@code
+ * PCollection input = ...;
+ * int p = ...;
+ * PCollection sketch = 
input.apply(HllCount.Init.longSketch().withPrecision(p).globally());
+ * }
+ *
+ * Example 2: Create bytes-type sketch for a {@code PCollection>}
+ *
+ * {@code
+ * PCollection> input = ...;
+ * PCollection> sketch = 
input.apply(HllCount.Init.bytesSketch().perKey());
+ * }
+ *
+ * Example 3: Merge existing sketches in a {@code PCollection} 
into a new one
+ *
+ * {@code
+ * PCollection sketches = ...;
+ * PCollection mergedSketch = 
sketches.apply(HllCount.MergePartial.globally());
+ * }
+ *
+ * Example 4: Estimates the count of distinct elements in a {@code 
PCollection}
+ *
+ * {@code
+ * PCollection input = ...;
+ * PCollection countDistinct =
+ * 
input.apply(HllCount.Init.stringSketch().globally()).apply(HllCount.Extract.globally());
+ * }
+ *
+ * Note: Currently HllCount does not work on FnAPI workers. See https://issues.apache.org/jira/browse/BEAM-7879;>Jira ticket 
[BEAM-7879].
+ */
+@Experimental
+public final class HllCount {
+
+  public static final int MINIMUM_PRECISION = 
HyperLogLogPlusPlus.MINIMUM_PRECISION;
 
 Review comment:
   Ah, thanks for catching this. Now I have set up javadoc correctly such that 
the actual numbers can be seen, and I think it is clear now that those are for 
users.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288235)
Time Spent: 10h  (was: 9h 50m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 

[jira] [Commented] (BEAM-7049) BeamUnionRel should work on mutiple input

2019-08-02 Thread sridhar Reddy (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899194#comment-16899194
 ] 

sridhar Reddy commented on BEAM-7049:
-

[~amaliujia] Thanks for the update. I will start working on this.

> BeamUnionRel should work on mutiple input 
> --
>
> Key: BEAM-7049
> URL: https://issues.apache.org/jira/browse/BEAM-7049
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Reporter: Rui Wang
>Assignee: sridhar Reddy
>Priority: Major
>
> BeamUnionRel assumes inputs are two and rejects more. So `a UNION b UNION c` 
> will have to be created as UNION(a, UNION(b, c)) and have two shuffles. If 
> BeamUnionRel can handle multiple shuffles, we will have only one shuffle



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Reopened] (BEAM-7878) Refine Spark runner dependencies

2019-08-02 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/BEAM-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía reopened BEAM-7878:


> Refine Spark runner dependencies
> 
>
> Key: BEAM-7878
> URL: https://issues.apache.org/jira/browse/BEAM-7878
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Minor
> Fix For: 2.16.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The Spark runner has more dependencies than it needs:
>  * The jackson_module_scala dependency is marked as compile but it is only 
> needed at runtime so it ends up leaking this module.
>  * The dropwizard, scala-library and commons-compress dependencies are 
> declared but unused.
>  * The Kafka dependency version is not aligned with the rest of Beam.
>  * The Kafka and zookeeper dependencies are used only for the tests but 
> currently are provided



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-7878) Refine Spark runner dependencies

2019-08-02 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/BEAM-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-7878:
---
Status: Open  (was: Triage Needed)

> Refine Spark runner dependencies
> 
>
> Key: BEAM-7878
> URL: https://issues.apache.org/jira/browse/BEAM-7878
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Minor
> Fix For: 2.16.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The Spark runner has more dependencies than it needs:
>  * The jackson_module_scala dependency is marked as compile but it is only 
> needed at runtime so it ends up leaking this module.
>  * The dropwizard, scala-library and commons-compress dependencies are 
> declared but unused.
>  * The Kafka dependency version is not aligned with the rest of Beam.
>  * The Kafka and zookeeper dependencies are used only for the tests but 
> currently are provided



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (BEAM-7049) BeamUnionRel should work on mutiple input

2019-08-02 Thread Rui Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899191#comment-16899191
 ] 

Rui Wang edited comment on BEAM-7049 at 8/2/19 8:29 PM:


[~sridharG]

A good start query is
 " SELECT 1 UNION 2 UNION 3".

good entry pointers are:

https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamUnionRule.java#L43
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamUnionRel.java#L66
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamSetOperatorRelBase.java#L61


The expected resolution is you can use multiple tags to GoGBK multiple 
PCollections through the same shuffle: 
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamSetOperatorRelBase.java#L85

And then you can improve binary implementation to make it handle more than two 
PCollection here: 
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/transform/BeamSetOperatorsTransforms.java#L60


was (Author: amaliujia):
A good start query is
 " SELECT 1 UNION 2 UNION 3".

good entry pointers are:

https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamUnionRule.java#L43
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamUnionRel.java#L66
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamSetOperatorRelBase.java#L61


The expected resolution is you can use multiple tags to GoGBK multiple 
PCollections through the same shuffle: 
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamSetOperatorRelBase.java#L85

And then you can improve binary implementation to make it handle more than two 
PCollection here: 
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/transform/BeamSetOperatorsTransforms.java#L60

> BeamUnionRel should work on mutiple input 
> --
>
> Key: BEAM-7049
> URL: https://issues.apache.org/jira/browse/BEAM-7049
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Reporter: Rui Wang
>Assignee: sridhar Reddy
>Priority: Major
>
> BeamUnionRel assumes inputs are two and rejects more. So `a UNION b UNION c` 
> will have to be created as UNION(a, UNION(b, c)) and have two shuffles. If 
> BeamUnionRel can handle multiple shuffles, we will have only one shuffle



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7049) BeamUnionRel should work on mutiple input

2019-08-02 Thread Rui Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899191#comment-16899191
 ] 

Rui Wang commented on BEAM-7049:


A good start query is
 " SELECT 1 UNION 2 UNION 3".

good entry pointers are:

https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamUnionRule.java#L43
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamUnionRel.java#L66
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamSetOperatorRelBase.java#L61


The expected resolution is you can use multiple tags to GoGBK multiple 
PCollections through the same shuffle: 
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamSetOperatorRelBase.java#L85

And then you can improve binary implementation to make it handle more than two 
PCollection here: 
https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/transform/BeamSetOperatorsTransforms.java#L60

> BeamUnionRel should work on mutiple input 
> --
>
> Key: BEAM-7049
> URL: https://issues.apache.org/jira/browse/BEAM-7049
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Reporter: Rui Wang
>Assignee: sridhar Reddy
>Priority: Major
>
> BeamUnionRel assumes inputs are two and rejects more. So `a UNION b UNION c` 
> will have to be created as UNION(a, UNION(b, c)) and have two shuffles. If 
> BeamUnionRel can handle multiple shuffles, we will have only one shuffle



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=288231=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288231
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 02/Aug/19 20:27
Start Date: 02/Aug/19 20:27
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r310286718
 
 

 ##
 File path: 
sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java
 ##
 @@ -200,6 +200,11 @@
  *
  * }
  *
+ * Consider using the {@code HllCount.Init} transform in the {@code 
zetasketch} extension module if
+ * you need to create sketches with format compatible with Google Cloud 
BigQuery. For more details
 
 Review comment:
   Done!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288231)
Time Spent: 9h 50m  (was: 9h 40m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 9h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (BEAM-7049) BeamUnionRel should work on mutiple input

2019-08-02 Thread Rui Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang reassigned BEAM-7049:
--

Assignee: sridhar Reddy

> BeamUnionRel should work on mutiple input 
> --
>
> Key: BEAM-7049
> URL: https://issues.apache.org/jira/browse/BEAM-7049
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Reporter: Rui Wang
>Assignee: sridhar Reddy
>Priority: Major
>
> BeamUnionRel assumes inputs are two and rejects more. So `a UNION b UNION c` 
> will have to be created as UNION(a, UNION(b, c)) and have two shuffles. If 
> BeamUnionRel can handle multiple shuffles, we will have only one shuffle



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7846) add test for BEAM-7689

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7846?focusedWorklogId=288226=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288226
 ]

ASF GitHub Bot logged work on BEAM-7846:


Author: ASF GitHub Bot
Created on: 02/Aug/19 20:18
Start Date: 02/Aug/19 20:18
Worklog Time Spent: 10m 
  Work Description: ihji commented on issue #9228: [BEAM-7846] add test for 
BEAM-7689
URL: https://github.com/apache/beam/pull/9228#issuecomment-517831685
 
 
   run java precommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288226)
Time Spent: 0.5h  (was: 20m)

> add test for BEAM-7689
> --
>
> Key: BEAM-7846
> URL: https://issues.apache.org/jira/browse/BEAM-7846
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-files, io-python-files
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> add test for BEAM-7689 and also Python counterpart so make sure that it won't 
> come back :)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288222=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288222
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 02/Aug/19 20:12
Start Date: 02/Aug/19 20:12
Worklog Time Spent: 10m 
  Work Description: jkff commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r310281527
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio_test.py
 ##
 @@ -30,38 +32,102 @@
 from apache_beam.io.mongodbio import _BoundedMongoSource
 from apache_beam.io.mongodbio import _GenerateObjectIdFn
 from apache_beam.io.mongodbio import _MongoSink
+from apache_beam.io.mongodbio import _ObjectIdRangeTracker
 from apache_beam.io.mongodbio import _WriteMongoFn
 from apache_beam.testing.test_pipeline import TestPipeline
 from apache_beam.testing.util import assert_that
 from apache_beam.testing.util import equal_to
 
 
+class _MockMongoColl(object):
+  def __init__(self, docs):
+self.docs = docs
+
+  def _filter(self, filter):
+match = []
+if not filter:
+  return self
+start = filter['_id'].get('$gte')
+end = filter['_id'].get('$lt')
+for doc in self.docs:
+  if start and doc['_id'] < start:
+continue
+  if end and doc['_id'] >= end:
+continue
+  match.append(doc)
+return match
+
+  def find(self, filter=None, **kwargs):
+return _MockMongoColl(self._filter(filter))
+
+  def sort(self, sort_items):
+key, order = sort_items[0]
+self.docs = sorted(self.docs,
+   key=lambda x: x[key],
+   reverse=(order != ASCENDING))
+return self
+
+  def limit(self, num):
+return _MockMongoColl(self.docs[0:num])
+
+  def count_documents(self, filter):
+return len(self._filter(filter))
+
+  def __getitem__(self, item):
+return self.docs[item]
+
+
 class MongoSourceTest(unittest.TestCase):
-  @mock.patch('apache_beam.io.mongodbio._BoundedMongoSource'
-  '._get_document_count')
-  @mock.patch('apache_beam.io.mongodbio._BoundedMongoSource'
-  '._get_avg_document_size')
-  def setUp(self, mock_size, mock_count):
-mock_size.return_value = 10
-mock_count.return_value = 5
+  @mock.patch('apache_beam.io.mongodbio.MongoClient')
+  def setUp(self, mock_client):
+mock_client.return_value.__enter__.return_value.__getitem__ \
+  .return_value.command.return_value = {'size': 5, 'avgSize': 1}
+self._ids = [
+objectid.ObjectId.from_datetime(
+datetime.datetime(year=2020, month=i + 1, day=i + 1))
+for i in range(5)
+]
+self._docs = [{'_id': self._ids[i], 'x': i} for i in range(len(self._ids))]
+
 self.mongo_source = _BoundedMongoSource('mongodb://test', 'testdb',
 'testcoll')
 
-  def test_estimate_size(self):
-self.assertEqual(self.mongo_source.estimate_size(), 50)
+  def get_split(self, command, ns, min, max, maxChunkSize, **kwargs):
 
 Review comment:
   * What does this do? It's not quite clear from the method body
   * maxChunkSize should probably be called max_chunk_size
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288222)
Time Spent: 1h 50m  (was: 1h 40m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288221=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288221
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 02/Aug/19 20:12
Start Date: 02/Aug/19 20:12
Worklog Time Spent: 10m 
  Work Description: jkff commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r310280167
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -194,18 +225,72 @@ def display_data(self):
 res['mongo_client_spec'] = self.spec
 return res
 
-  def _get_avg_document_size(self):
+  def _get_split_keys(self, desired_chunk_size, start_pos, end_pos):
+# if desired chunk size smaller than 1mb, use mongodb default split size of
+# 1mb
+if desired_chunk_size < 1:
+  desired_chunk_size = 1
+if start_pos >= end_pos:
+  # single document not splittable
+  return []
 with MongoClient(self.uri, **self.spec) as client:
-  size = client[self.db].command('collstats', self.coll).get('avgObjSize')
-  if size is None or size <= 0:
-raise ValueError(
-'Collection %s not found or average doc size is '
-'incorrect', self.coll)
-  return size
-
-  def _get_document_count(self):
+  name_space = '%s.%s' % (self.db, self.coll)
+  return (client[self.db].command(
+  'splitVector',
+  name_space,
+  keyPattern={'_id': 1},
+  min={'_id': start_pos},
+  max={'_id': end_pos},
+  maxChunkSize=desired_chunk_size)['splitKeys'])
+
+  def _get_last_document_id(self):
 with MongoClient(self.uri, **self.spec) as client:
-  return max(client[self.db][self.coll].count_documents(self.filter), 0)
+  cursor = client[self.db][self.coll].find(filter={}, projection=[]).sort([
+  ('_id', DESCENDING)
+  ]).limit(1)
+  try:
+return cursor[0]['_id']
+  except IndexError:
+raise ValueError('Empty Mongodb collection')
+
+
+class _ObjectIdRangeTracker(OrderedPositionRangeTracker):
+  """RangeTracker for tracking mongodb _id of bson ObjectId type."""
+
+  def _id_to_int(self, id):
+# id object is bytes type with size of 12
+ints = struct.unpack('>III', id.binary)
+return 2**64 * ints[0] + 2**32 * ints[1] + ints[2]
+
+  def _int_to_id(self, numbers):
 
 Review comment:
   This kind of bignum arithmetic is extremely hard to get right, please make 
sure the unit tests cover enough corner cases. I would recommend to add some 
randomized property testing.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288221)
Time Spent: 1h 40m  (was: 1.5h)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't 

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288218=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288218
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 02/Aug/19 20:12
Start Date: 02/Aug/19 20:12
Worklog Time Spent: 10m 
  Work Description: jkff commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r310278529
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -139,50 +143,77 @@ def __init__(self,
 self.filter = filter
 self.projection = projection
 self.spec = extra_client_params
-self.doc_count = self._get_document_count()
-self.avg_doc_size = self._get_avg_document_size()
-self.client = None
 
   def estimate_size(self):
-return self.avg_doc_size * self.doc_count
+with MongoClient(self.uri, **self.spec) as client:
+  size = client[self.db].command('collstats', self.coll).get('size')
+  if size is None or size <= 0:
+raise ValueError('Collection %s not found or total doc size is '
+ 'incorrect' % self.coll)
+  return size
 
   def split(self, desired_bundle_size, start_position=None, 
stop_position=None):
 # use document cursor index as the start and stop positions
 if start_position is None:
-  start_position = 0
+  epoch = datetime.datetime(1970, 1, 1)
+  start_position = objectid.ObjectId.from_datetime(epoch)
 if stop_position is None:
-  stop_position = self.doc_count
+  last_doc_id = self._get_last_document_id()
+  # add one sec to make sure the last document is not excluded
+  last_timestamp_plus_one_sec = (last_doc_id.generation_time +
+ datetime.timedelta(seconds=1))
+  stop_position = objectid.ObjectId.from_datetime(
+  last_timestamp_plus_one_sec)
 
-# get an estimate on how many documents should be included in a split batch
-desired_bundle_count = desired_bundle_size // self.avg_doc_size
+desired_bundle_size_in_mb = desired_bundle_size // 1024 // 1024
+split_keys = self._get_split_keys(desired_bundle_size_in_mb, 
start_position,
+  stop_position)
 
 bundle_start = start_position
-while bundle_start < stop_position:
-  bundle_end = min(stop_position, bundle_start + desired_bundle_count)
-  yield iobase.SourceBundle(weight=bundle_end - bundle_start,
+for split_key_id in split_keys:
+  bundle_end = min(stop_position, split_key_id)
+  if bundle_start is None and bundle_start < stop_position:
+return
+  yield iobase.SourceBundle(weight=desired_bundle_size_in_mb,
 source=self,
 start_position=bundle_start,
 stop_position=bundle_end)
   bundle_start = bundle_end
+# add range of last split_key to stop_position
+if bundle_start < stop_position:
+  yield iobase.SourceBundle(weight=desired_bundle_size_in_mb,
+source=self,
+start_position=bundle_start,
+stop_position=stop_position)
 
   def get_range_tracker(self, start_position, stop_position):
 if start_position is None:
-  start_position = 0
+  epoch = datetime.datetime(1970, 1, 1)
 
 Review comment:
   When can start_position and stop_position be None in get_range_tracker?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288218)
Time Spent: 1.5h  (was: 1h 20m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> 

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288223=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288223
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 02/Aug/19 20:12
Start Date: 02/Aug/19 20:12
Worklog Time Spent: 10m 
  Work Description: jkff commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r310281025
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio_test.py
 ##
 @@ -30,38 +32,102 @@
 from apache_beam.io.mongodbio import _BoundedMongoSource
 from apache_beam.io.mongodbio import _GenerateObjectIdFn
 from apache_beam.io.mongodbio import _MongoSink
+from apache_beam.io.mongodbio import _ObjectIdRangeTracker
 from apache_beam.io.mongodbio import _WriteMongoFn
 from apache_beam.testing.test_pipeline import TestPipeline
 from apache_beam.testing.util import assert_that
 from apache_beam.testing.util import equal_to
 
 
+class _MockMongoColl(object):
+  def __init__(self, docs):
+self.docs = docs
+
+  def _filter(self, filter):
+match = []
+if not filter:
+  return self
+start = filter['_id'].get('$gte')
+end = filter['_id'].get('$lt')
+for doc in self.docs:
+  if start and doc['_id'] < start:
+continue
+  if end and doc['_id'] >= end:
+continue
+  match.append(doc)
+return match
+
+  def find(self, filter=None, **kwargs):
+return _MockMongoColl(self._filter(filter))
+
+  def sort(self, sort_items):
+key, order = sort_items[0]
+self.docs = sorted(self.docs,
+   key=lambda x: x[key],
+   reverse=(order != ASCENDING))
+return self
+
+  def limit(self, num):
+return _MockMongoColl(self.docs[0:num])
+
+  def count_documents(self, filter):
+return len(self._filter(filter))
+
+  def __getitem__(self, item):
+return self.docs[item]
+
+
 class MongoSourceTest(unittest.TestCase):
-  @mock.patch('apache_beam.io.mongodbio._BoundedMongoSource'
-  '._get_document_count')
-  @mock.patch('apache_beam.io.mongodbio._BoundedMongoSource'
-  '._get_avg_document_size')
-  def setUp(self, mock_size, mock_count):
-mock_size.return_value = 10
-mock_count.return_value = 5
+  @mock.patch('apache_beam.io.mongodbio.MongoClient')
+  def setUp(self, mock_client):
+mock_client.return_value.__enter__.return_value.__getitem__ \
 
 Review comment:
   It's hard to follow what this line is doing, could you add a comment? Ditto 
for other similar statements.
   I'm wondering if mocking is even the right approach here. Maybe create a 
fake instead? Seems like that could be much more readable and less fragile 
w.r.t. precise order of method calls.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288223)
Time Spent: 2h  (was: 1h 50m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic 

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288217=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288217
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 02/Aug/19 20:12
Start Date: 02/Aug/19 20:12
Worklog Time Spent: 10m 
  Work Description: jkff commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r310278966
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -139,50 +143,77 @@ def __init__(self,
 self.filter = filter
 self.projection = projection
 self.spec = extra_client_params
-self.doc_count = self._get_document_count()
-self.avg_doc_size = self._get_avg_document_size()
-self.client = None
 
   def estimate_size(self):
-return self.avg_doc_size * self.doc_count
+with MongoClient(self.uri, **self.spec) as client:
+  size = client[self.db].command('collstats', self.coll).get('size')
+  if size is None or size <= 0:
+raise ValueError('Collection %s not found or total doc size is '
+ 'incorrect' % self.coll)
+  return size
 
   def split(self, desired_bundle_size, start_position=None, 
stop_position=None):
 # use document cursor index as the start and stop positions
 if start_position is None:
-  start_position = 0
+  epoch = datetime.datetime(1970, 1, 1)
+  start_position = objectid.ObjectId.from_datetime(epoch)
 if stop_position is None:
-  stop_position = self.doc_count
+  last_doc_id = self._get_last_document_id()
+  # add one sec to make sure the last document is not excluded
+  last_timestamp_plus_one_sec = (last_doc_id.generation_time +
+ datetime.timedelta(seconds=1))
+  stop_position = objectid.ObjectId.from_datetime(
+  last_timestamp_plus_one_sec)
 
-# get an estimate on how many documents should be included in a split batch
-desired_bundle_count = desired_bundle_size // self.avg_doc_size
+desired_bundle_size_in_mb = desired_bundle_size // 1024 // 1024
+split_keys = self._get_split_keys(desired_bundle_size_in_mb, 
start_position,
+  stop_position)
 
 bundle_start = start_position
-while bundle_start < stop_position:
-  bundle_end = min(stop_position, bundle_start + desired_bundle_count)
-  yield iobase.SourceBundle(weight=bundle_end - bundle_start,
+for split_key_id in split_keys:
+  bundle_end = min(stop_position, split_key_id)
+  if bundle_start is None and bundle_start < stop_position:
+return
+  yield iobase.SourceBundle(weight=desired_bundle_size_in_mb,
 source=self,
 start_position=bundle_start,
 stop_position=bundle_end)
   bundle_start = bundle_end
+# add range of last split_key to stop_position
+if bundle_start < stop_position:
+  yield iobase.SourceBundle(weight=desired_bundle_size_in_mb,
+source=self,
+start_position=bundle_start,
+stop_position=stop_position)
 
   def get_range_tracker(self, start_position, stop_position):
 if start_position is None:
-  start_position = 0
+  epoch = datetime.datetime(1970, 1, 1)
+  start_position = objectid.ObjectId.from_datetime(epoch)
 if stop_position is None:
-  stop_position = self.doc_count
-return OffsetRangeTracker(start_position, stop_position)
+  last_doc_id = self._get_last_document_id()
+  # add one sec to make sure the last document is not excluded
+  last_timestamp_plus_one_sec = (last_doc_id.generation_time +
+ datetime.timedelta(seconds=1))
+  stop_position = objectid.ObjectId.from_datetime(
+  last_timestamp_plus_one_sec)
+return _ObjectIdRangeTracker(start_position, stop_position)
 
   def read(self, range_tracker):
 with MongoClient(self.uri, **self.spec) as client:
-  # docs is a MongoDB Cursor
-  docs = client[self.db][self.coll].find(
-  filter=self.filter, projection=self.projection
-  )[range_tracker.start_position():range_tracker.stop_position()]
-  for index in range(range_tracker.start_position(),
- range_tracker.stop_position()):
-if not range_tracker.try_claim(index):
+  all_filters = self.filter
+  all_filters.update({
+  '_id': {
+  '$gte': range_tracker.start_position(),
 
 Review comment:
   What if the original filter already has gte/lt conditions on _id? I think 
it's fine to prohibit it (at construction time); but you can also take it into 
account and infer the 

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288224=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288224
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 02/Aug/19 20:12
Start Date: 02/Aug/19 20:12
Worklog Time Spent: 10m 
  Work Description: jkff commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r310279232
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -194,18 +225,72 @@ def display_data(self):
 res['mongo_client_spec'] = self.spec
 return res
 
-  def _get_avg_document_size(self):
+  def _get_split_keys(self, desired_chunk_size, start_pos, end_pos):
+# if desired chunk size smaller than 1mb, use mongodb default split size of
+# 1mb
+if desired_chunk_size < 1:
 
 Review comment:
   What are the units of desired_chunk_size? The comment implies mb?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288224)
Time Spent: 2h  (was: 1h 50m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288219=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288219
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 02/Aug/19 20:12
Start Date: 02/Aug/19 20:12
Worklog Time Spent: 10m 
  Work Description: jkff commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r310280815
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio_test.py
 ##
 @@ -30,38 +32,102 @@
 from apache_beam.io.mongodbio import _BoundedMongoSource
 from apache_beam.io.mongodbio import _GenerateObjectIdFn
 from apache_beam.io.mongodbio import _MongoSink
+from apache_beam.io.mongodbio import _ObjectIdRangeTracker
 from apache_beam.io.mongodbio import _WriteMongoFn
 from apache_beam.testing.test_pipeline import TestPipeline
 from apache_beam.testing.util import assert_that
 from apache_beam.testing.util import equal_to
 
 
+class _MockMongoColl(object):
+  def __init__(self, docs):
+self.docs = docs
+
+  def _filter(self, filter):
+match = []
+if not filter:
+  return self
+start = filter['_id'].get('$gte')
 
 Review comment:
   Please assert that start, end are both not None
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288219)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288220=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288220
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 02/Aug/19 20:12
Start Date: 02/Aug/19 20:12
Worklog Time Spent: 10m 
  Work Description: jkff commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r310264499
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -139,50 +143,77 @@ def __init__(self,
 self.filter = filter
 self.projection = projection
 self.spec = extra_client_params
-self.doc_count = self._get_document_count()
-self.avg_doc_size = self._get_avg_document_size()
-self.client = None
 
   def estimate_size(self):
-return self.avg_doc_size * self.doc_count
+with MongoClient(self.uri, **self.spec) as client:
+  size = client[self.db].command('collstats', self.coll).get('size')
+  if size is None or size <= 0:
+raise ValueError('Collection %s not found or total doc size is '
+ 'incorrect' % self.coll)
+  return size
 
   def split(self, desired_bundle_size, start_position=None, 
stop_position=None):
 # use document cursor index as the start and stop positions
 if start_position is None:
-  start_position = 0
+  epoch = datetime.datetime(1970, 1, 1)
 
 Review comment:
   * Please add a comment that this is an object id smaller than any possible 
actual object id
   * Does Mongo actually guarantee this property? Maybe it makes sense to 
explicitly query for the smallest and largest object id in the database, if it 
can be done quickly?
   * Using start and end position that are far removed from the actual ids of 
present objects risks having most of the splits be empty, or at least having a 
couple of splits at the edges that are difficult to liquid-shard. This is a 
known issue with e.g. bigtable and shuffle sources in Dataflow. In this sense 
too, querying for smallest and largest id would be better.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288220)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--

[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288205=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288205
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 02/Aug/19 19:57
Start Date: 02/Aug/19 19:57
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #9223: [BEAM-7060] Introduce 
Python3-only test modules
URL: https://github.com/apache/beam/pull/9223#issuecomment-517826104
 
 
   run python precommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288205)
Time Spent: 7.5h  (was: 7h 20m)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (BEAM-7724) Codegen should cast(null) to a type to match exact function signature

2019-08-02 Thread sridhar Reddy (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sridhar Reddy resolved BEAM-7724.
-
   Resolution: Won't Fix
Fix Version/s: Not applicable

Ideally, this issue should be addressed within Calcite repo. This will not be 
fixed in Beam repo at this time. 

> Codegen should cast(null) to a type to match exact function signature
> -
>
> Key: BEAM-7724
> URL: https://issues.apache.org/jira/browse/BEAM-7724
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Reporter: Rui Wang
>Assignee: sridhar Reddy
>Priority: Major
> Fix For: Not applicable
>
>
> If there are two function signatures for the same function name, when input 
> parameter is null, Janino will throw exception due to vagueness:
> A(String)
> A(Integer)
> Janino does not know how to match A(null).  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7724) Codegen should cast(null) to a type to match exact function signature

2019-08-02 Thread sridhar Reddy (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899162#comment-16899162
 ] 

sridhar Reddy commented on BEAM-7724:
-

As per the above discussion, I will be closing this issue.

> Codegen should cast(null) to a type to match exact function signature
> -
>
> Key: BEAM-7724
> URL: https://issues.apache.org/jira/browse/BEAM-7724
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Reporter: Rui Wang
>Assignee: sridhar Reddy
>Priority: Major
>
> If there are two function signatures for the same function name, when input 
> parameter is null, Janino will throw exception due to vagueness:
> A(String)
> A(Integer)
> Janino does not know how to match A(null).  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7528) Save correctly Python Load Tests metrics according to it's namespace

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7528?focusedWorklogId=288186=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288186
 ]

ASF GitHub Bot logged work on BEAM-7528:


Author: ASF GitHub Bot
Created on: 02/Aug/19 19:11
Start Date: 02/Aug/19 19:11
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #8941: [BEAM-7528] 
Save load test metrics according to distribution name
URL: https://github.com/apache/beam/pull/8941#discussion_r310259723
 
 

 ##
 File path: 
sdks/python/apache_beam/testing/load_tests/load_test_metrics_utils.py
 ##
 @@ -138,8 +143,25 @@ def as_dict(self):
 class CounterMetric(Metric):
   def __init__(self, counter_dict, submit_timestamp, metric_id):
 super(CounterMetric, self).__init__(submit_timestamp, metric_id)
-self.value = counter_dict.committed
 self.label = str(counter_dict.key.metric.name)
+self.value = counter_dict.committed
+
+
+class DistributionMetrics(Metric):
+  def __init__(self, dists, submit_timestamp, metric_id, key):
 
 Review comment:
   Distributions are unique by name, namespace and step. That means that 
multiple steps can collect separate metrics with the same name We could use 
the namespace to separate them (e.g. "BeamPerfNamespace" or whatever).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288186)
Time Spent: 6.5h  (was: 6h 20m)

> Save correctly Python Load Tests metrics according to it's namespace
> 
>
> Key: BEAM-7528
> URL: https://issues.apache.org/jira/browse/BEAM-7528
> Project: Beam
>  Issue Type: Bug
>  Components: testing
>Reporter: Kasia Kucharczyk
>Assignee: Kasia Kucharczyk
>Priority: Major
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Load test framework considers all distribution metrics defined in a pipeline 
> as a `runtime` metric (which is defined by the loadtest framework), while 
> only  `runtime` distribution metric should be considered as runtime.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7528) Save correctly Python Load Tests metrics according to it's namespace

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7528?focusedWorklogId=288185=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288185
 ]

ASF GitHub Bot logged work on BEAM-7528:


Author: ASF GitHub Bot
Created on: 02/Aug/19 19:11
Start Date: 02/Aug/19 19:11
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #8941: [BEAM-7528] 
Save load test metrics according to distribution name
URL: https://github.com/apache/beam/pull/8941#discussion_r310263173
 
 

 ##
 File path: 
sdks/python/apache_beam/testing/load_tests/load_test_metrics_utils.py
 ##
 @@ -138,8 +143,25 @@ def as_dict(self):
 class CounterMetric(Metric):
   def __init__(self, counter_dict, submit_timestamp, metric_id):
 super(CounterMetric, self).__init__(submit_timestamp, metric_id)
-self.value = counter_dict.committed
 self.label = str(counter_dict.key.metric.name)
+self.value = counter_dict.committed
+
+
+class DistributionMetrics(Metric):
 
 Review comment:
   I am okay with either 1 or 2. Do we have full metric names( 
step+namespace+metric)? 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288185)
Time Spent: 6h 20m  (was: 6h 10m)

> Save correctly Python Load Tests metrics according to it's namespace
> 
>
> Key: BEAM-7528
> URL: https://issues.apache.org/jira/browse/BEAM-7528
> Project: Beam
>  Issue Type: Bug
>  Components: testing
>Reporter: Kasia Kucharczyk
>Assignee: Kasia Kucharczyk
>Priority: Major
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Load test framework considers all distribution metrics defined in a pipeline 
> as a `runtime` metric (which is defined by the loadtest framework), while 
> only  `runtime` distribution metric should be considered as runtime.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (BEAM-3645) Support multi-process execution on the FnApiRunner

2019-08-02 Thread Hannah Jiang (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896575#comment-16896575
 ] 

Hannah Jiang edited comment on BEAM-3645 at 8/2/19 7:03 PM:


Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

*How to run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*How to run with multithreading mode:*
{code:java}
# using in memory embedded runner
p = beam.Pipeline(options=pipeline_options)
{code}
{code:java}
# using embedded grpc runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
    default_environment=beam_runner_api_pb2.Environment(
      urn=python_urns.EMBEDDED_PYTHON_GRPC,
      payload=b'1'))) # payload is # of threads of each worker.{code}
 

*--direct_num_workers* option is used to control number of workers. Default 
value is 1. 
{code:java}
# an example to pass it from CLI.
python wordcount.py --input xx --output xx --direct_num_workers 2

# an example to set it with PipelineOptions.
from apache_beam.options.pipeline_options import PipelineOptions
pipeline_options = PipelineOptions(['--direct_num_workers', '2'])

# an example add it to existing pipeline options.
from apache_beam.options.pipeline_options import DirectOptions
pipeline_options = xxx
pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code}


was (Author: hannahjiang):
Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

*How to run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*How to run with multithreading mode:*
{code:java}
# using in memory embedded runner
p = beam.Pipeline(options=pipeline_options)
{code}
{code:java}
# using embedded grpc runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
    default_environment=beam_runner_api_pb2.Environment(
      urn=python_urns.EMBEDDED_PYTHON_GRPC,
      payload=b'1'))) # payload is # of threads of each worker.{code}
 

*--direct_num_workers* option is used to control number of workers. Default 
value is 1. 
{code:java}
# an example to pass it from CLI.
python wordcount.py --input xx --output xx --direct_num_workers 2

# an example to set it with PipelineOptions.
from apache_beam.options.pipeline_options import PipelineOptions
pipeline_options = PipelineOptions(['--direct_num_workers', '2'])

# an example add it to existing pipeline options.
pipeline_options = xxx
pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code}

> Support multi-process execution on the FnApiRunner
> --
>
> Key: BEAM-3645
> URL: https://issues.apache.org/jira/browse/BEAM-3645
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Charles Chen
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.15.0
>
>  Time Spent: 35h 20m
>  Remaining Estimate: 0h
>
> https://issues.apache.org/jira/browse/BEAM-3644 gave us a 15x performance 
> gain over the previous DirectRunner.  We can do even better in multi-core 
> environments by supporting multi-process execution in the FnApiRunner, to 
> scale past Python GIL limitations.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (BEAM-3645) Support multi-process execution on the FnApiRunner

2019-08-02 Thread Hannah Jiang (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896575#comment-16896575
 ] 

Hannah Jiang edited comment on BEAM-3645 at 8/2/19 7:02 PM:


Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

*How to run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*How to run with multithreading mode:*
{code:java}
# using in memory embedded runner
p = beam.Pipeline(options=pipeline_options)
{code}
{code:java}
# using embedded grpc runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
    default_environment=beam_runner_api_pb2.Environment(
      urn=python_urns.EMBEDDED_PYTHON_GRPC,
      payload=b'1'))) # payload is # of threads of each worker.{code}
 

*--direct_num_workers* option is used to control number of workers. Default 
value is 1. 
{code:java}
# an example to pass it from CLI.
python wordcount.py --input xx --output xx --direct_num_workers 2

# an example to set it with PipelineOptions.
from apache_beam.options.pipeline_options import PipelineOptions
pipeline_options = PipelineOptions(['--direct_num_workers', '2'])

# an example add it to existing pipeline options.
pipeline_options = xxx
pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code}


was (Author: hannahjiang):
Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

*How to run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*How to run with multithreading mode:*
{code:java}
# using in memory embedded runner
p = beam.Pipeline(options=pipeline_options)
{code}
{code:java}
# using embedded grpc runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
    default_environment=beam_runner_api_pb2.Environment(
      urn=python_urns.EMBEDDED_PYTHON_GRPC,
      payload=b'1'))) # payload is # of threads of each worker.{code}
 

*--direct_num_workers* option is used to control number of workers. Default 
value is 1. 
{code:java}
# an example to pass it from CLI.
python wordcount.py --input xx --output xx --direct_num_workers 2

# an example to set it with pipeline options directly.
from apache_beam.options.pipeline_options import PipelineOptions
pipeline_options = PipelineOptions(['--direct_num_workers', '2'])

# an example add it to existing pipeline options.
pipeline_options = xxx
pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code}

> Support multi-process execution on the FnApiRunner
> --
>
> Key: BEAM-3645
> URL: https://issues.apache.org/jira/browse/BEAM-3645
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Charles Chen
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.15.0
>
>  Time Spent: 35h 20m
>  Remaining Estimate: 0h
>
> https://issues.apache.org/jira/browse/BEAM-3644 gave us a 15x performance 
> gain over the previous DirectRunner.  We can do even better in multi-core 
> environments by supporting multi-process execution in the FnApiRunner, to 
> scale past Python GIL limitations.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288173=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288173
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 02/Aug/19 18:55
Start Date: 02/Aug/19 18:55
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on issue #9223: [BEAM-7060] 
Introduce Python3-only test modules
URL: https://github.com/apache/beam/pull/9223#issuecomment-517809174
 
 
   A similar change was reverted because it broke postcommits:  
https://github.com/apache/beam/pull/8505#issuecomment-498441270
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288173)
Time Spent: 7h 20m  (was: 7h 10m)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288171=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288171
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 02/Aug/19 18:53
Start Date: 02/Aug/19 18:53
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #9223: [BEAM-7060] Introduce 
Python3-only test modules
URL: https://github.com/apache/beam/pull/9223#issuecomment-517808582
 
 
   @tvalentyn I believe pre-commits cover all the changes in this PR.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288171)
Time Spent: 7h 10m  (was: 7h)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7476) Datastore write failures with "[Errno 32] Broken pipe"

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7476?focusedWorklogId=288170=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288170
 ]

ASF GitHub Bot logged work on BEAM-7476:


Author: ASF GitHub Bot
Created on: 02/Aug/19 18:51
Start Date: 02/Aug/19 18:51
Worklog Time Spent: 10m 
  Work Description: udim commented on pull request #8346: [BEAM-7476] Retry 
Datastore writes on [Errno 32] Broken pipe
URL: https://github.com/apache/beam/pull/8346
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288170)
Time Spent: 1.5h  (was: 1h 20m)

> Datastore write failures with "[Errno 32] Broken pipe"
> --
>
> Key: BEAM-7476
> URL: https://issues.apache.org/jira/browse/BEAM-7476
> Project: Beam
>  Issue Type: Bug
>  Components: io-python-gcp
>Affects Versions: 2.11.0
> Environment: dataflow python 2.7
>Reporter: Dmytro Sadovnychyi
>Assignee: Dmytro Sadovnychyi
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> We are getting lots of Broken pipe errors and it's only a matter of luck for 
> write to succeed. It's been happening for months.
> Partial stack trace:
> File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/datastore/v1/helper.py",
>  line 225, in commit
> response = datastore.commit(request)
>   File 
> "/usr/local/lib/python2.7/dist-packages/googledatastore/connection.py", line 
> 140, in commit
> datastore_pb2.CommitResponse)
>   File 
> "/usr/local/lib/python2.7/dist-packages/googledatastore/connection.py", line 
> 199, in _call_method
> method='POST', body=payload, headers=headers)
>   File "/usr/local/lib/python2.7/dist-packages/oauth2client/transport.py", 
> line 169, in new_request
> redirections, connection_type)
>   File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 
> 1609, in request
> (response, content) = self._request(conn, authority, uri, request_uri, 
> method, body, headers, redirections, cachekey)
>   File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 
> 1351, in _request
> (response, content) = self._conn_request(conn, request_uri, method, body, 
> headers)
>   File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 
> 1273, in _conn_request
> conn.request(method, request_uri, body, headers)
>   File "/usr/lib/python2.7/httplib.py", line 1042, in request
> self._send_request(method, url, body, headers)
>   File "/usr/lib/python2.7/httplib.py", line 1082, in _send_request
> self.endheaders(body)
>   File "/usr/lib/python2.7/httplib.py", line 1038, in endheaders
> self._send_output(message_body)
>   File "/usr/lib/python2.7/httplib.py", line 882, in _send_output
> self.send(msg)
>   File "/usr/lib/python2.7/httplib.py", line 858, in send
> self.sock.sendall(data)
>   File "/usr/lib/python2.7/ssl.py", line 753, in sendall
> v = self.send(data[count:])
>   File "/usr/lib/python2.7/ssl.py", line 719, in send
> v = self._sslobj.write(data)
> RuntimeError: error: [Errno 32] Broken pipe [while running 'Groups to 
> datastore/Write Mutation to Datastore']
> Workaround: https://github.com/apache/beam/pull/8346



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7884) Run pylint in Python 3

2019-08-02 Thread Udi Meiri (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899137#comment-16899137
 ] 

Udi Meiri commented on BEAM-7884:
-

Optional step 0:
Since files ending with py3.py (introduced in 
https://github.com/apache/beam/pull/9223) are currently excluded from 
run_pylint.sh checks (which on run on Python 2),
the highest priority would be to test those files.

> Run pylint in Python 3
> --
>
> Key: BEAM-7884
> URL: https://issues.apache.org/jira/browse/BEAM-7884
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Priority: Major
>
> Currently run_pylint.sh only runs in tox environment py27-lint.
> py35-lint uses another script named run_mini_py3lint.sh.
> 1. Make run_pylint.sh run and pass for py35-lint. Might have to add some 
> ignores to .pylintrc (such as useless-object-inheritance), and might have to 
> rename some deprecated functions (such as 
> s/assertRaisesRegexp/assertRaisesRegex/).
> 2. Make sure all tests in run_mini_py3lint.sh are in run_pylint.sh and remove 
> run_mini_py3lint.sh.
> 3. (optional) Remove py27-lint?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7577) Allow the use of ValueProviders in datastore.v1new.datastoreio.ReadFromDatastore query

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7577?focusedWorklogId=288165=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288165
 ]

ASF GitHub Bot logged work on BEAM-7577:


Author: ASF GitHub Bot
Created on: 02/Aug/19 18:45
Start Date: 02/Aug/19 18:45
Worklog Time Spent: 10m 
  Work Description: udim commented on pull request #8950: [BEAM-7577] Allow 
ValueProviders in Datastore Query filters
URL: https://github.com/apache/beam/pull/8950
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288165)
Time Spent: 6h 40m  (was: 6.5h)

> Allow the use of ValueProviders in 
> datastore.v1new.datastoreio.ReadFromDatastore query
> --
>
> Key: BEAM-7577
> URL: https://issues.apache.org/jira/browse/BEAM-7577
> Project: Beam
>  Issue Type: New Feature
>  Components: io-python-gcp
>Affects Versions: 2.13.0
>Reporter: EDjur
>Assignee: EDjur
>Priority: Minor
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> The current implementation of ReadFromDatastore does not support specifying 
> the query parameter at runtime. This could potentially be fixed through the 
> usage of a ValueProvider to specify and build the Datastore query.
> Allowing specifying the query at runtime makes it easier to use dynamic 
> queries in Dataflow templates. Currently, there is no way to have a Dataflow 
> template that includes a dynamic query (such as filtering by a timestamp or 
> similar).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288163=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288163
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 02/Aug/19 18:43
Start Date: 02/Aug/19 18:43
Worklog Time Spent: 10m 
  Work Description: udim commented on pull request #9223: [BEAM-7060] 
Introduce Python3-only test modules
URL: https://github.com/apache/beam/pull/9223#discussion_r310254106
 
 

 ##
 File path: sdks/python/scripts/run_pylint.sh
 ##
 @@ -60,23 +60,34 @@ EXCLUDED_GENERATED_FILES=(
 apache_beam/portability/api/*pb2*.py
 )
 
+PYTHON_MAJOR=$(python -c 'import sys; print(sys.version_info[0])')
+if [[ "${PYTHON_MAJOR}" == 2 ]]; then
+  EXCLUDED_PY3_FILES=$(find ${MODULE} | grep 'py3\.py$')
+  echo -e "Excluding Py3 files:\n${EXCLUDED_PY3_FILES}"
+else
+  EXCLUDED_PY3_FILES=""
+fi
+
 FILES_TO_IGNORE=""
-for file in "${EXCLUDED_GENERATED_FILES[@]}"; do
+for file in "${EXCLUDED_GENERATED_FILES[@]}" ${EXCLUDED_PY3_FILES}; do
 
 Review comment:
   Yes, see: https://scans.gradle.com/s/544uyirsdn6n4/console-log#L206-L209
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288163)
Time Spent: 7h  (was: 6h 50m)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288157=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288157
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 02/Aug/19 18:36
Start Date: 02/Aug/19 18:36
Worklog Time Spent: 10m 
  Work Description: udim commented on pull request #9223: [BEAM-7060] 
Introduce Python3-only test modules
URL: https://github.com/apache/beam/pull/9223#discussion_r310251805
 
 

 ##
 File path: sdks/python/scripts/run_pylint.sh
 ##
 @@ -60,23 +60,34 @@ EXCLUDED_GENERATED_FILES=(
 apache_beam/portability/api/*pb2*.py
 )
 
+PYTHON_MAJOR=$(python -c 'import sys; print(sys.version_info[0])')
 
 Review comment:
   done: https://issues.apache.org/jira/browse/BEAM-7884
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288157)
Time Spent: 6h 50m  (was: 6h 40m)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (BEAM-7884) Run pylint in Python 3

2019-08-02 Thread Udi Meiri (JIRA)
Udi Meiri created BEAM-7884:
---

 Summary: Run pylint in Python 3
 Key: BEAM-7884
 URL: https://issues.apache.org/jira/browse/BEAM-7884
 Project: Beam
  Issue Type: Sub-task
  Components: sdk-py-core
Reporter: Udi Meiri


Currently run_pylint.sh only runs in tox environment py27-lint.
py35-lint uses another script named run_mini_py3lint.sh.

1. Make run_pylint.sh run and pass for py35-lint. Might have to add some 
ignores to .pylintrc (such as useless-object-inheritance), and might have to 
rename some deprecated functions (such as 
s/assertRaisesRegexp/assertRaisesRegex/).
2. Make sure all tests in run_mini_py3lint.sh are in run_pylint.sh and remove 
run_mini_py3lint.sh.
3. (optional) Remove py27-lint?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7868) Hidden Flink Runner parameters are dropped in python pipelines

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7868?focusedWorklogId=288156=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288156
 ]

ASF GitHub Bot logged work on BEAM-7868:


Author: ASF GitHub Bot
Created on: 02/Aug/19 18:34
Start Date: 02/Aug/19 18:34
Worklog Time Spent: 10m 
  Work Description: angoenka commented on pull request #9211: [BEAM-7868] 
optionally skip hidden pipeline options
URL: https://github.com/apache/beam/pull/9211
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288156)
Time Spent: 0.5h  (was: 20m)

> Hidden Flink Runner parameters are dropped in python pipelines
> --
>
> Key: BEAM-7868
> URL: https://issues.apache.org/jira/browse/BEAM-7868
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Ankur Goenka
>Assignee: Ankur Goenka
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Hidden pipeline options for Portable flink runner are not interpreted by 
> Python SDK.
> An example of this is 
> ManualDockerEnvironmentOptions.getRetainDockerContainers which is not 
> interpreted in python sdk.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288155=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288155
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 02/Aug/19 18:29
Start Date: 02/Aug/19 18:29
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on issue #9223: [BEAM-7060] 
Introduce Python3-only test modules
URL: https://github.com/apache/beam/pull/9223#issuecomment-517801554
 
 
   Run Python 2 PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288155)
Time Spent: 6h 40m  (was: 6.5h)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288154=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288154
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 02/Aug/19 18:28
Start Date: 02/Aug/19 18:28
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on pull request #9223: [BEAM-7060] 
Introduce Python3-only test modules
URL: https://github.com/apache/beam/pull/9223#discussion_r310248802
 
 

 ##
 File path: sdks/python/scripts/run_pylint.sh
 ##
 @@ -60,23 +60,34 @@ EXCLUDED_GENERATED_FILES=(
 apache_beam/portability/api/*pb2*.py
 )
 
+PYTHON_MAJOR=$(python -c 'import sys; print(sys.version_info[0])')
+if [[ "${PYTHON_MAJOR}" == 2 ]]; then
+  EXCLUDED_PY3_FILES=$(find ${MODULE} | grep 'py3\.py$')
+  echo -e "Excluding Py3 files:\n${EXCLUDED_PY3_FILES}"
+else
+  EXCLUDED_PY3_FILES=""
+fi
+
 FILES_TO_IGNORE=""
-for file in "${EXCLUDED_GENERATED_FILES[@]}"; do
+for file in "${EXCLUDED_GENERATED_FILES[@]}" ${EXCLUDED_PY3_FILES}; do
 
 Review comment:
   Does this loop work as intended when $EXCLUDED_PY3_FILES has more than one 
file?
   
   Alternative solution: 
https://github.com/apache/beam/pull/8505/files#diff-d91ac373b8d2235683a85576912b5db5R66
   which requires: 
https://github.com/apache/beam/pull/8505/files#diff-d91ac373b8d2235683a85576912b5db5R87
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288154)
Time Spent: 6.5h  (was: 6h 20m)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (BEAM-3645) Support multi-process execution on the FnApiRunner

2019-08-02 Thread Hannah Jiang (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896575#comment-16896575
 ] 

Hannah Jiang edited comment on BEAM-3645 at 8/2/19 6:24 PM:


Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

*How to run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*How to run with multithreading mode:*
{code:java}
# using in memory embedded runner
p = beam.Pipeline(options=pipeline_options)
{code}
{code:java}
# using embedded grpc runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
    default_environment=beam_runner_api_pb2.Environment(
      urn=python_urns.EMBEDDED_PYTHON_GRPC,
      payload=b'1'))) # payload is # of threads of each worker.{code}
 

*--direct_num_workers* option is used to control number of workers. Default 
value is 1. 
{code:java}
# an example to pass it from CLI.
python wordcount.py --input xx --output xx --direct_num_workers 2

# an example to set it with pipeline options directly.
from apache_beam.options.pipeline_options import PipelineOptions
pipeline_options = PipelineOptions(['--direct_num_workers', '2'])

# an example add it to existing pipeline options.
pipeline_options = xxx
pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code}


was (Author: hannahjiang):
Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

*How to run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*How to run with multithreading mode:*
{code:java}
# using in memory embedded runner
p = beam.Pipeline(options=pipeline_options)
{code}
{code:java}
# using embedded grpc runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
    default_environment=beam_runner_api_pb2.Environment(
      urn=python_urns.EMBEDDED_PYTHON_GRPC,
      payload=b'1'))) # payload is # of threads of each worker.{code}
 

*--direct_num_workers* option is used to control number of workers. Default 
value is 1. 
{code:java}
# an example to pass from CLI.
python wordcount.py --input xx --output xx --direct_num_workers 2

# an example to set it with pipeline options directly.
from apache_beam.options.pipeline_options import PipelineOptions
pipeline_options = PipelineOptions(['--direct_num_workers', '2'])

# an example add it to existing pipeline options.
pipeline_options = xxx
pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code}

> Support multi-process execution on the FnApiRunner
> --
>
> Key: BEAM-3645
> URL: https://issues.apache.org/jira/browse/BEAM-3645
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Charles Chen
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.15.0
>
>  Time Spent: 35h 20m
>  Remaining Estimate: 0h
>
> https://issues.apache.org/jira/browse/BEAM-3644 gave us a 15x performance 
> gain over the previous DirectRunner.  We can do even better in multi-core 
> environments by supporting multi-process execution in the FnApiRunner, to 
> scale past Python GIL limitations.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (BEAM-3645) Support multi-process execution on the FnApiRunner

2019-08-02 Thread Hannah Jiang (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896575#comment-16896575
 ] 

Hannah Jiang edited comment on BEAM-3645 at 8/2/19 6:24 PM:


Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

*How to run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*How to run with multithreading mode:*
{code:java}
# using in memory embedded runner
p = beam.Pipeline(options=pipeline_options)
{code}
{code:java}
# using embedded grpc runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
    default_environment=beam_runner_api_pb2.Environment(
      urn=python_urns.EMBEDDED_PYTHON_GRPC,
      payload=b'1'))) # payload is # of threads of each worker.{code}
 

*--direct_num_workers* option is used to control number of workers. Default 
value is 1. 
{code:java}
# an example to pass from CLI.
python wordcount.py --input xx --output xx --direct_num_workers 2

# an example to set it with pipeline options directly.
from apache_beam.options.pipeline_options import PipelineOptions
pipeline_options = PipelineOptions(['--direct_num_workers', '2'])

# an example add it to existing pipeline options.
pipeline_options = xxx
pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code}


was (Author: hannahjiang):
Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

*How to run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*How to run with multithreading mode:*
{code:java}
# using in memory embedded runner
p = beam.Pipeline(options=pipeline_options)
{code}
{code:java}
# using embedded grpc runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
    default_environment=beam_runner_api_pb2.Environment(
      urn=python_urns.EMBEDDED_PYTHON_GRPC,
      payload=b'1'))) # payload is # of threads of each worker.{code}
 

*--direct_num_workers* option is used to control number of workers. Default 
value is 1. 
{code:java}
# an example to pass --direct_num_workers to a job.
python wordcount.py --input xx --output xx --direct_num_workers 2

# an example to set it with pipeline options directly.
from apache_beam.options.pipeline_options import PipelineOptions
pipeline_options = PipelineOptions(['--direct_num_workers', '2'])

# an example add it to existing pipeline options.
pipeline_options = xxx
pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code}

> Support multi-process execution on the FnApiRunner
> --
>
> Key: BEAM-3645
> URL: https://issues.apache.org/jira/browse/BEAM-3645
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Charles Chen
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.15.0
>
>  Time Spent: 35h 20m
>  Remaining Estimate: 0h
>
> https://issues.apache.org/jira/browse/BEAM-3644 gave us a 15x performance 
> gain over the previous DirectRunner.  We can do even better in multi-core 
> environments by supporting multi-process execution in the FnApiRunner, to 
> scale past Python GIL limitations.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (BEAM-3645) Support multi-process execution on the FnApiRunner

2019-08-02 Thread Hannah Jiang (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896575#comment-16896575
 ] 

Hannah Jiang edited comment on BEAM-3645 at 8/2/19 6:22 PM:


Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

*How to run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*How to run with multithreading mode:*
{code:java}
# using in memory embedded runner
p = beam.Pipeline(options=pipeline_options)
{code}
{code:java}
# using embedded grpc runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
    default_environment=beam_runner_api_pb2.Environment(
      urn=python_urns.EMBEDDED_PYTHON_GRPC,
      payload=b'1'))) # payload is # of threads of each worker.{code}
 

*--direct_num_workers* option is used to control number of workers. Default 
value is 1. 
{code:java}
# an example to pass --direct_num_workers to a job.
python wordcount.py --input xx --output xx --direct_num_workers 2

# an example to set it with pipeline options directly.
from apache_beam.options.pipeline_options import PipelineOptions
pipeline_options = PipelineOptions(['--direct_num_workers', '2'])

# an example add it to existing pipeline options.
pipeline_options = xxx
pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code}


was (Author: hannahjiang):
Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

*How to run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*How to run with multithreading mode:*
{code:java}
# using in memory embedded runner
p = beam.Pipeline(options=pipeline_options)
{code}
{code:java}
# using embedded grpc runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
    default_environment=beam_runner_api_pb2.Environment(
      urn=python_urns.EMBEDDED_PYTHON_GRPC,
      payload=b'1'))) # payload is # of threads of each worker.{code}
 

*--direct_num_workers* option is used to control number of workers. Default 
value is 1. 
{code:java}
# an example to pass --direct_num_workers to a job.
python wordcount.py --input xx --output xx --direct_num_workers 2

# an example to set it with pipeline options directly.
from apache_beam.options.pipeline_options import PipelineOptions
pipeline_options = PipelineOptions(['--direct_num_workers', '2'])

# an example add to existing pipeline options.
pipeline_options = xxx
pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code}

> Support multi-process execution on the FnApiRunner
> --
>
> Key: BEAM-3645
> URL: https://issues.apache.org/jira/browse/BEAM-3645
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Charles Chen
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.15.0
>
>  Time Spent: 35h 20m
>  Remaining Estimate: 0h
>
> https://issues.apache.org/jira/browse/BEAM-3644 gave us a 15x performance 
> gain over the previous DirectRunner.  We can do even better in multi-core 
> environments by supporting multi-process execution in the FnApiRunner, to 
> scale past Python GIL limitations.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (BEAM-3645) Support multi-process execution on the FnApiRunner

2019-08-02 Thread Hannah Jiang (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896575#comment-16896575
 ] 

Hannah Jiang edited comment on BEAM-3645 at 8/2/19 6:22 PM:


Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

*How to run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*How to run with multithreading mode:*
{code:java}
# using in memory embedded runner
p = beam.Pipeline(options=pipeline_options)
{code}
{code:java}
# using embedded grpc runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
    default_environment=beam_runner_api_pb2.Environment(
      urn=python_urns.EMBEDDED_PYTHON_GRPC,
      payload=b'1'))) # payload is # of threads of each worker.{code}
 

*--direct_num_workers* option is used to control number of workers. Default 
value is 1. 
{code:java}
# an example to pass --direct_num_workers to a job.
python wordcount.py --input xx --output xx --direct_num_workers 2

# an example to set it with pipeline options directly.
from apache_beam.options.pipeline_options import PipelineOptions
pipeline_options = PipelineOptions(['--direct_num_workers', '2'])

# an example add to existing pipeline options.
pipeline_options = xxx
pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code}


was (Author: hannahjiang):
Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

*How to run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*How to run with multithreading mode:*
{code:java}
# using in memory embedded runner
p = beam.Pipeline(options=pipeline_options)
{code}
{code:java}
# using embedded grpc runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
    default_environment=beam_runner_api_pb2.Environment(
      urn=python_urns.EMBEDDED_PYTHON_GRPC,
      payload=b'1'))) # payload is # of threads of each worker.{code}
 

*--direct_num_workers* option is used to control number of workers. Default 
value is 1. 
{code:java}
# an example to pass --direct_num_workers to a job.
python wordcount.py --input xx --output xx --direct_num_workers 2
{code}

> Support multi-process execution on the FnApiRunner
> --
>
> Key: BEAM-3645
> URL: https://issues.apache.org/jira/browse/BEAM-3645
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Charles Chen
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.15.0
>
>  Time Spent: 35h 20m
>  Remaining Estimate: 0h
>
> https://issues.apache.org/jira/browse/BEAM-3644 gave us a 15x performance 
> gain over the previous DirectRunner.  We can do even better in multi-core 
> environments by supporting multi-process execution in the FnApiRunner, to 
> scale past Python GIL limitations.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=288150=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288150
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 02/Aug/19 18:17
Start Date: 02/Aug/19 18:17
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r310244911
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java
 ##
 @@ -0,0 +1,350 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.zetasketch;
+
+import com.google.zetasketch.HyperLogLogPlusPlus;
+import org.apache.beam.sdk.annotations.Experimental;
+import org.apache.beam.sdk.transforms.Combine;
+import org.apache.beam.sdk.transforms.DoFn;
+import org.apache.beam.sdk.transforms.PTransform;
+import org.apache.beam.sdk.transforms.ParDo;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+
+/**
+ * {@code PTransform}s to compute HyperLogLogPlusPlus (HLL++) sketches on data 
streams based on the
+ * https://github.com/google/zetasketch;>ZetaSketch 
implementation.
+ *
+ * HLL++ is an algorithm implemented by Google that estimates the count of 
distinct elements in a
+ * data stream. HLL++ requires significantly less memory than the linear 
memory needed for exact
+ * computation, at the cost of a small error. Cardinalities of arbitrary 
breakdowns can be computed
+ * using the HLL++ sketch. See this http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf;>published
+ * paper for details about the algorithm.
+ *
+ * HLL++ functions are also supported in https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions;>Google
 Cloud
+ * BigQuery. Using the {@code HllCount PTransform}s makes the 
interoperation with BigQuery
+ * easier.
+ *
+ * For detailed design of this class, see https://s.apache.org/hll-in-beam.
+ *
+ * Examples
+ *
+ * Example 1: Create long-type sketch for a {@code PCollection} and 
specify precision
+ *
+ * {@code
+ * PCollection input = ...;
+ * int p = ...;
+ * PCollection sketch = 
input.apply(HllCount.Init.longSketch().withPrecision(p).globally());
+ * }
+ *
+ * Example 2: Create bytes-type sketch for a {@code PCollection>}
+ *
+ * {@code
+ * PCollection> input = ...;
+ * PCollection> sketch = 
input.apply(HllCount.Init.bytesSketch().perKey());
+ * }
+ *
+ * Example 3: Merge existing sketches in a {@code PCollection} 
into a new one
+ *
+ * {@code
+ * PCollection sketches = ...;
+ * PCollection mergedSketch = 
sketches.apply(HllCount.MergePartial.globally());
+ * }
+ *
+ * Example 4: Estimates the count of distinct elements in a {@code 
PCollection}
+ *
+ * {@code
+ * PCollection input = ...;
+ * PCollection countDistinct =
+ * 
input.apply(HllCount.Init.stringSketch().globally()).apply(HllCount.Extract.globally());
+ * }
+ *
+ * Note: Currently HllCount does not work on FnAPI workers. See https://issues.apache.org/jira/browse/BEAM-7879;>Jira ticket 
[BEAM-7879].
+ */
+@Experimental
+public final class HllCount {
+
+  public static final int MINIMUM_PRECISION = 
HyperLogLogPlusPlus.MINIMUM_PRECISION;
 
 Review comment:
   Ah, got it! But the Javadoc does not expand the actual numbers, or does it? 
   Maybe leave a comment with "constants replicated here for documentation 
purposes / to make them easy to find for users"? 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288150)
Time Spent: 9h 40m  (was: 9.5h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 

[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288145=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288145
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 02/Aug/19 18:15
Start Date: 02/Aug/19 18:15
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9223: 
[BEAM-7060] Introduce Python3-only test modules
URL: https://github.com/apache/beam/pull/9223#discussion_r310242868
 
 

 ##
 File path: sdks/python/scripts/run_pylint.sh
 ##
 @@ -60,23 +60,34 @@ EXCLUDED_GENERATED_FILES=(
 apache_beam/portability/api/*pb2*.py
 )
 
+PYTHON_MAJOR=$(python -c 'import sys; print(sys.version_info[0])')
 
 Review comment:
   Please add a JIRA to do lint checks for *py3.py files.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288145)
Time Spent: 6h 20m  (was: 6h 10m)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7874) FnApi only supports up to 10 workers

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7874?focusedWorklogId=288141=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288141
 ]

ASF GitHub Bot logged work on BEAM-7874:


Author: ASF GitHub Bot
Created on: 02/Aug/19 18:13
Start Date: 02/Aug/19 18:13
Worklog Time Spent: 10m 
  Work Description: Hannah-Jiang commented on issue #9218: [BEAM-7874], 
[BEAM-7873] FnApi bugfixs
URL: https://github.com/apache/beam/pull/9218#issuecomment-517416968
 
 
   R: @robertwb 
   Can you please take a look and merge if it looks good? Thank you.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288141)
Time Spent: 50m  (was: 40m)

> FnApi only supports up to 10 workers
> 
>
> Key: BEAM-7874
> URL: https://issues.apache.org/jira/browse/BEAM-7874
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.15.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Because max_workers of grpc servers are hardcoded to 10, it only supports up 
> to 10 workers, and if we pass more direct_num_workers greater than 10, 
> pipeline hangs, because not all workers get connected to the runner.
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/fn_api_runner.py#L1141]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7874) FnApi only supports up to 10 workers

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7874?focusedWorklogId=288143=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288143
 ]

ASF GitHub Bot logged work on BEAM-7874:


Author: ASF GitHub Bot
Created on: 02/Aug/19 18:13
Start Date: 02/Aug/19 18:13
Worklog Time Spent: 10m 
  Work Description: Hannah-Jiang commented on pull request #9218: 
[BEAM-7874], [BEAM-7873] FnApi bugfixs
URL: https://github.com/apache/beam/pull/9218#discussion_r310240952
 
 

 ##
 File path: sdks/python/apache_beam/runners/portability/local_job_service.py
 ##
 @@ -200,10 +202,13 @@ def run(self):
 if self._worker_id:
   env_dict['WORKER_ID'] = self._worker_id
 
-p = subprocess.Popen(
-self._worker_command_line,
-shell=True,
-env=env_dict)
+# deadlock would happen with py2, so add a lock here.
+# no problem with py3.
+with SubprocessSdkWorker._lock:
+  p = subprocess.Popen(
 
 Review comment:
   It turns out that deadlock happens with Popen(), rather than with p.wait().
   Ran 1500 times of wordcount.py with 3 workers and confirmed deadlock doesn't 
happen.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288143)
Time Spent: 1h  (was: 50m)

> FnApi only supports up to 10 workers
> 
>
> Key: BEAM-7874
> URL: https://issues.apache.org/jira/browse/BEAM-7874
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.15.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Because max_workers of grpc servers are hardcoded to 10, it only supports up 
> to 10 workers, and if we pass more direct_num_workers greater than 10, 
> pipeline hangs, because not all workers get connected to the runner.
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/fn_api_runner.py#L1141]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288140=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288140
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 02/Aug/19 18:12
Start Date: 02/Aug/19 18:12
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9223: 
[BEAM-7060] Introduce Python3-only test modules
URL: https://github.com/apache/beam/pull/9223#discussion_r310242868
 
 

 ##
 File path: sdks/python/scripts/run_pylint.sh
 ##
 @@ -60,23 +60,34 @@ EXCLUDED_GENERATED_FILES=(
 apache_beam/portability/api/*pb2*.py
 )
 
+PYTHON_MAJOR=$(python -c 'import sys; print(sys.version_info[0])')
 
 Review comment:
   Please add a JIRA to do lint checks for Python 3.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288140)
Time Spent: 6h 10m  (was: 6h)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7874) FnApi only supports up to 10 workers

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7874?focusedWorklogId=288137=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288137
 ]

ASF GitHub Bot logged work on BEAM-7874:


Author: ASF GitHub Bot
Created on: 02/Aug/19 18:06
Start Date: 02/Aug/19 18:06
Worklog Time Spent: 10m 
  Work Description: Hannah-Jiang commented on pull request #9218: 
[BEAM-7874], [BEAM-7873] FnApi bugfixs
URL: https://github.com/apache/beam/pull/9218#discussion_r310240952
 
 

 ##
 File path: sdks/python/apache_beam/runners/portability/local_job_service.py
 ##
 @@ -200,10 +202,13 @@ def run(self):
 if self._worker_id:
   env_dict['WORKER_ID'] = self._worker_id
 
-p = subprocess.Popen(
-self._worker_command_line,
-shell=True,
-env=env_dict)
+# deadlock would happen with py2, so add a lock here.
+# no problem with py3.
+with SubprocessSdkWorker._lock:
+  p = subprocess.Popen(
 
 Review comment:
   It turns out that deadlock happens with Popen(), rather than with p.wait().
   Ran 1500 of wordcount.py with 3 workers and confirmed deadlock doesn't 
happen.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288137)
Time Spent: 40m  (was: 0.5h)

> FnApi only supports up to 10 workers
> 
>
> Key: BEAM-7874
> URL: https://issues.apache.org/jira/browse/BEAM-7874
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Because max_workers of grpc servers are hardcoded to 10, it only supports up 
> to 10 workers, and if we pass more direct_num_workers greater than 10, 
> pipeline hangs, because not all workers get connected to the runner.
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/fn_api_runner.py#L1141]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-7873) FnApi with Subprocess runner hangs frequently when running with multi workers with py2

2019-08-02 Thread Hannah Jiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-7873:
---
Description: Pipeline hangs at 
[subprocess.Popen()|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203]
 when shut it down. I looked into source code of subprocess lib. 
[py27|https://github.com/enthought/Python-2.7.3/blob/master/Lib/subprocess.py#L1286]
 doesn't do any lock while 
[py3|https://github.com/python/cpython/blob/3.7/Lib/subprocess.py#L1592] locks 
when waiting. Py3 added locks at other places of Popen() as well, all unlocked 
places with py2 may contribute to the problem. We can add a lock when calling 
Popen() to prevent the deadlock.   (was: Pipeline hangs at 
[subprocess.Popen()|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203]when
 shut it down. I looked into source code of subprocess lib. 
[py27|https://github.com/enthought/Python-2.7.3/blob/master/Lib/subprocess.py#L1286]
 doesn't do any lock while 
[py3|https://github.com/python/cpython/blob/3.7/Lib/subprocess.py#L1592] locks 
when waiting. Py3 added locks at other places of Popen() as well, all unlocked 
places with py2 may contribute to the problem. We can add a lock when calling 
Popen() to prevent deadlock.

{{[subprocess.Popen|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203]}}

{{[subprocess.Popen()|http://atlassian.com/]}}

 )

> FnApi with Subprocess runner hangs frequently when running with multi workers 
> with py2
> --
>
> Key: BEAM-7873
> URL: https://issues.apache.org/jira/browse/BEAM-7873
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.15.0
>
>
> Pipeline hangs at 
> [subprocess.Popen()|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203]
>  when shut it down. I looked into source code of subprocess lib. 
> [py27|https://github.com/enthought/Python-2.7.3/blob/master/Lib/subprocess.py#L1286]
>  doesn't do any lock while 
> [py3|https://github.com/python/cpython/blob/3.7/Lib/subprocess.py#L1592] 
> locks when waiting. Py3 added locks at other places of Popen() as well, all 
> unlocked places with py2 may contribute to the problem. We can add a lock 
> when calling Popen() to prevent the deadlock. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-7873) FnApi with Subprocess runner hangs frequently when running with multi workers with py2

2019-08-02 Thread Hannah Jiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-7873:
---
Description: 
Pipeline hangs at 
[Popen()|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203]]when
 shut it down. I looked into source code of subprocess lib. 
[py27|https://github.com/enthought/Python-2.7.3/blob/master/Lib/subprocess.py#L1286]
 doesn't do any lock while 
[py3|https://github.com/python/cpython/blob/3.7/Lib/subprocess.py#L1592] locks 
when waiting. Py3 added locks at other places of Popen() as well, all unlocked 
places with py2 may contribute to the problem. We can add a lock when calling 
Popen() to prevent deadlock.

{{[subprocess.Popen|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203]]}}

{{[subprocess.Popen()|http://atlassian.com]}}

 

  was:
Pipeline hangs at 
[Popen()|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203]]when
 shut it down. I looked into source code of subprocess lib. 
[py27|https://github.com/enthought/Python-2.7.3/blob/master/Lib/subprocess.py#L1286]
 doesn't do any lock while 
[py3|https://github.com/python/cpython/blob/3.7/Lib/subprocess.py#L1592] locks 
when waiting. Py3 added locks at other places of Popen() as well, all unlocked 
places with py2 may contribute to the problem. We can add a lock when calling 
Popen() to prevent deadlock.

 
{{[subprocess.Popen|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203]]}}

 


> FnApi with Subprocess runner hangs frequently when running with multi workers 
> with py2
> --
>
> Key: BEAM-7873
> URL: https://issues.apache.org/jira/browse/BEAM-7873
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.15.0
>
>
> Pipeline hangs at 
> [Popen()|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203]]when
>  shut it down. I looked into source code of subprocess lib. 
> [py27|https://github.com/enthought/Python-2.7.3/blob/master/Lib/subprocess.py#L1286]
>  doesn't do any lock while 
> [py3|https://github.com/python/cpython/blob/3.7/Lib/subprocess.py#L1592] 
> locks when waiting. Py3 added locks at other places of Popen() as well, all 
> unlocked places with py2 may contribute to the problem. We can add a lock 
> when calling Popen() to prevent deadlock.
> {{[subprocess.Popen|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service.py#L203]]}}
> {{[subprocess.Popen()|http://atlassian.com]}}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


  1   2   3   >