[jira] [Work logged] (BEAM-3489) Expose the message id of received messages within PubsubMessage

2019-08-18 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3489?focusedWorklogId=296964=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-296964
 ]

ASF GitHub Bot logged work on BEAM-3489:


Author: ASF GitHub Bot
Created on: 18/Aug/19 22:07
Start Date: 18/Aug/19 22:07
Worklog Time Spent: 10m 
  Work Description: thinhha commented on issue #8370: [BEAM-3489] add 
PubSub messageId in PubsubMessage
URL: https://github.com/apache/beam/pull/8370#issuecomment-522360044
 
 
   Run Java PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 296964)
Time Spent: 7h 50m  (was: 7h 40m)

> Expose the message id of received messages within PubsubMessage
> ---
>
> Key: BEAM-3489
> URL: https://issues.apache.org/jira/browse/BEAM-3489
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Luke Cwik
>Assignee: Thinh Ha
>Priority: Minor
>  Labels: newbie, starter
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> This task is about passing forward the message id from the pubsub proto to 
> the java PubsubMessage.
> Add a message id field to PubsubMessage.
> Update the coder for PubsubMessage to encode the message id.
> Update the translation from the Pubsub proto message to the Dataflow message:
> https://github.com/apache/beam/blob/2e275264b21db45787833502e5e42907b05e28b8/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubUnboundedSource.java#L976



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7980) External environment with containerized worker pool

2019-08-18 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7980?focusedWorklogId=296953=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-296953
 ]

ASF GitHub Bot logged work on BEAM-7980:


Author: ASF GitHub Bot
Created on: 18/Aug/19 19:48
Start Date: 18/Aug/19 19:48
Worklog Time Spent: 10m 
  Work Description: tweise commented on pull request #9371: [BEAM-7980] 
External environment with containerized worker pool
URL: https://github.com/apache/beam/pull/9371
 
 
   Add option to use the Python SDK Docker image for execution as worker pool, 
launching individual workers through boot.go as per container spec.
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build
 

[jira] [Commented] (BEAM-2572) Implement an S3 filesystem for Python SDK

2019-08-18 Thread Pablo Estrada (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910040#comment-16910040
 ] 

Pablo Estrada commented on BEAM-2572:
-

[~mattmorgis] any updates?

> Implement an S3 filesystem for Python SDK
> -
>
> Key: BEAM-2572
> URL: https://issues.apache.org/jira/browse/BEAM-2572
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py-core
>Reporter: Dmitry Demeshchuk
>Assignee: Pablo Estrada
>Priority: Minor
>  Labels: GSoC2019, gsoc, gsoc2019, mentor
>
> There are two paths worth exploring, to my understanding:
> 1. Sticking to the HDFS-based approach (like it's done in Java).
> 2. Using boto/boto3 for accessing S3 through its common API endpoints.
> I personally prefer the second approach, for a few reasons:
> 1. In real life, HDFS and S3 have different consistency guarantees, therefore 
> their behaviors may contradict each other in some edge cases (say, we write 
> something to S3, but it's not immediately accessible for reading from another 
> end).
> 2. There are other AWS-based sources and sinks we may want to create in the 
> future: DynamoDB, Kinesis, SQS, etc.
> 3. boto3 already provides somewhat good logic for basic things like 
> reattempting.
> Whatever path we choose, there's another problem related to this: we 
> currently cannot pass any global settings (say, pipeline options, or just an 
> arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the 
> runner nodes to have AWS keys set up in the environment, which is not trivial 
> to achieve and doesn't look too clean either (I'd rather see one single place 
> for configuring the runner options).
> Also, it's worth mentioning that I already have a janky S3 filesystem 
> implementation that only supports DirectRunner at the moment (because of the 
> previous paragraph). I'm perfectly fine finishing it myself, with some 
> guidance from the maintainers.
> Where should I move on from here, and whose input should I be looking for?
> Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-3763) Add per-transform documentation to the website

2019-08-18 Thread Pablo Estrada (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910035#comment-16910035
 ] 

Pablo Estrada commented on BEAM-3763:
-

[~rtnguyen] you're working ons omething like this right? : ) 

> Add per-transform documentation to the website
> --
>
> Key: BEAM-3763
> URL: https://issues.apache.org/jira/browse/BEAM-3763
> Project: Beam
>  Issue Type: Task
>  Components: website
>Reporter: Rafael Fernandez
>Priority: Minor
>  Labels: easyfix, reference
>
> Add structure to the website to incrementally document per-transform 
> definitions and examples. The idea is to incrementally populate this section 
> and clean up stale javadoc entries which have unworkable / outdated examples.
>  
> This task tracks creating the right structure for the website. Each transform 
> cleanup/documentation will come with its own JIRA, to facilitate other 
> members of the community to pick up outstanding work.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (BEAM-8000) Add Delete method to gRPC JobService

2019-08-18 Thread Chad Dombrova (JIRA)
Chad Dombrova created BEAM-8000:
---

 Summary: Add Delete method to gRPC JobService
 Key: BEAM-8000
 URL: https://issues.apache.org/jira/browse/BEAM-8000
 Project: Beam
  Issue Type: Improvement
  Components: beam-model
Reporter: Chad Dombrova
Assignee: Chad Dombrova


As a user of the InMemoryJobService, I want a method to purge jobs from memory 
when they are no longer needed, so that the service does not balloon in memory 
usage over time.

I was planning to name this Delete.  Also considering the name Purge.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7742) BigQuery File Loads to work well with load job size limits

2019-08-18 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7742?focusedWorklogId=296945=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-296945
 ]

ASF GitHub Bot logged work on BEAM-7742:


Author: ASF GitHub Bot
Created on: 18/Aug/19 17:27
Start Date: 18/Aug/19 17:27
Worklog Time Spent: 10m 
  Work Description: ttanay commented on issue #9242: [BEAM-7742] Partition 
files in BQFL to cater to quotas & limits
URL: https://github.com/apache/beam/pull/9242#issuecomment-522339955
 
 
   R: @pabloem 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 296945)
Time Spent: 2h 20m  (was: 2h 10m)

> BigQuery File Loads to work well with load job size limits
> --
>
> Key: BEAM-7742
> URL: https://issues.apache.org/jira/browse/BEAM-7742
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp
>Reporter: Pablo Estrada
>Assignee: Tanay Tummalapalli
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Load jobs into BigQuery have a number of limitations: 
> [https://cloud.google.com/bigquery/quotas#load_jobs]
>  
> Currently, the python BQ sink implemented in `bigquery_file_loads.py` does 
> not handle these limitations well. Improvements need to be made to the 
> miplementation, to:
>  * Decide to use temp_tables dynamically at pipeline execution
>  * Add code to determine when a load job to a single destination needs to be 
> partitioned into multiple jobs.
>  * When this happens, then we definitely need to use temp_tables, in case one 
> of the two load jobs fails, and the pipeline is rerun.
> Tanay, would you be able to look at this?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7742) BigQuery File Loads to work well with load job size limits

2019-08-18 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7742?focusedWorklogId=296895=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-296895
 ]

ASF GitHub Bot logged work on BEAM-7742:


Author: ASF GitHub Bot
Created on: 18/Aug/19 08:24
Start Date: 18/Aug/19 08:24
Worklog Time Spent: 10m 
  Work Description: ttanay commented on issue #9242: [BEAM-7742] Partition 
files in BQFL to cater to quotas & limits
URL: https://github.com/apache/beam/pull/9242#issuecomment-522301896
 
 
   Run Python 3.5 PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 296895)
Time Spent: 2h 10m  (was: 2h)

> BigQuery File Loads to work well with load job size limits
> --
>
> Key: BEAM-7742
> URL: https://issues.apache.org/jira/browse/BEAM-7742
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp
>Reporter: Pablo Estrada
>Assignee: Tanay Tummalapalli
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Load jobs into BigQuery have a number of limitations: 
> [https://cloud.google.com/bigquery/quotas#load_jobs]
>  
> Currently, the python BQ sink implemented in `bigquery_file_loads.py` does 
> not handle these limitations well. Improvements need to be made to the 
> miplementation, to:
>  * Decide to use temp_tables dynamically at pipeline execution
>  * Add code to determine when a load job to a single destination needs to be 
> partitioned into multiple jobs.
>  * When this happens, then we definitely need to use temp_tables, in case one 
> of the two load jobs fails, and the pipeline is rerun.
> Tanay, would you be able to look at this?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7819) PubsubMessage message parsing is lacking non-attribute fields

2019-08-18 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7819?focusedWorklogId=296891=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-296891
 ]

ASF GitHub Bot logged work on BEAM-7819:


Author: ASF GitHub Bot
Created on: 18/Aug/19 06:37
Start Date: 18/Aug/19 06:37
Worklog Time Spent: 10m 
  Work Description: matt-darwin commented on issue #9232: [BEAM-7819] 
Python - parse PubSub message_id into attributes property
URL: https://github.com/apache/beam/pull/9232#issuecomment-522295664
 
 
   Run Python Dataflow ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 296891)
Time Spent: 7h 40m  (was: 7.5h)

> PubsubMessage message parsing is lacking non-attribute fields
> -
>
> Key: BEAM-7819
> URL: https://issues.apache.org/jira/browse/BEAM-7819
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp
>Reporter: Ahmet Altay
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> User reported issue: 
> https://lists.apache.org/thread.html/139b0c15abc6471a2e2202d76d915c645a529a23ecc32cd9cfecd315@%3Cuser.beam.apache.org%3E
> """
> Looking at the source code, with my untrained python eyes, I think if the 
> intention is to include the message id and the publish time in the attributes 
> attribute of the PubSubMessage type, then the protobuf mapping is missing 
> something:-
> @staticmethod
> def _from_proto_str(proto_msg):
> """Construct from serialized form of ``PubsubMessage``.
> Args:
> proto_msg: String containing a serialized protobuf of type
> https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage
> Returns:
> A new PubsubMessage object.
> """
> msg = pubsub.types.pubsub_pb2.PubsubMessage()
> msg.ParseFromString(proto_msg)
> # Convert ScalarMapContainer to dict.
> attributes = dict((key, msg.attributes[key]) for key in msg.attributes)
> return PubsubMessage(msg.data, attributes)
> The protobuf definition is here:-
> https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage
> and so it looks as if the message_id and publish_time are not being parsed as 
> they are seperate from the attributes. Perhaps the PubsubMessage class needs 
> expanding to include these as attributes, or they would need adding to the 
> dictionary for attributes. This would only need doing for the _from_proto_str 
> as obviously they would not need to be populated when transmitting a message 
> to PubSub.
> My python is not great, I'm assuming the latter option would need to look 
> something like this?
> attributes = dict((key, msg.attributes[key]) for key in msg.attributes)
> attributes.update({'message_id': msg.message_id, 'publish_time': 
> msg.publish_time})
> return PubsubMessage(msg.data, attributes)
> """



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)