[jira] [Created] (BEAM-7942) Kafka backward compatibility

2019-08-11 Thread Reenu Saluja (JIRA)
Reenu Saluja created BEAM-7942:
--

 Summary: Kafka backward compatibility 
 Key: BEAM-7942
 URL: https://issues.apache.org/jira/browse/BEAM-7942
 Project: Beam
  Issue Type: Improvement
  Components: io-java-kafka
Affects Versions: 2.13.0
Reporter: Reenu Saluja
 Fix For: 2.8.0


getting below error using kafka 1.0 with Beam 2.13.0

Uncaught throwable from user code: java.lang.IllegalStateException: No 
TransformEvaluator registered for BOUNDED transform Read(CreateSource)
at 
org.apache.beam.vendor.guava.v20_0.com.google.common.base.Preconditions.checkState(Preconditions.java:518)

So I had to downgrade beam version to 2.8.0 to work it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-7878) Refine Spark runner dependencies

2019-08-11 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/BEAM-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-7878:
---
Description: 
The Spark runner has more dependencies than it needs:
 * The jackson_module_scala dependency is marked as compile but it is only 
needed at runtime so it ends up leaking this module and breaking compatibility 
for users providing Spark on scala 2.12.
 * The dropwizard, scala-library and commons-compress dependencies are declared 
but unused.
 * The Kafka dependency version is not aligned with the rest of Beam.
 * The Kafka and zookeeper dependencies are used only for the tests but 
currently are provided

  was:
The Spark runner has more dependencies than it needs:
 * The jackson_module_scala dependency is marked as compile but it is only 
needed at runtime so it ends up leaking this module.
 * The dropwizard, scala-library and commons-compress dependencies are declared 
but unused.
 * The Kafka dependency version is not aligned with the rest of Beam.
 * The Kafka and zookeeper dependencies are used only for the tests but 
currently are provided


> Refine Spark runner dependencies
> 
>
> Key: BEAM-7878
> URL: https://issues.apache.org/jira/browse/BEAM-7878
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Minor
> Fix For: 2.15.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> The Spark runner has more dependencies than it needs:
>  * The jackson_module_scala dependency is marked as compile but it is only 
> needed at runtime so it ends up leaking this module and breaking 
> compatibility for users providing Spark on scala 2.12.
>  * The dropwizard, scala-library and commons-compress dependencies are 
> declared but unused.
>  * The Kafka dependency version is not aligned with the rest of Beam.
>  * The Kafka and zookeeper dependencies are used only for the tests but 
> currently are provided



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7878) Refine Spark runner dependencies

2019-08-11 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7878?focusedWorklogId=292778&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-292778
 ]

ASF GitHub Bot logged work on BEAM-7878:


Author: ASF GitHub Bot
Created on: 11/Aug/19 21:36
Start Date: 11/Aug/19 21:36
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #9310: 
[release-2.15.0][BEAM-7878] Refine Spark runner dependencies
URL: https://github.com/apache/beam/pull/9310#issuecomment-520262881
 
 
   Run Java Spark PortableValidatesRunner Batch
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 292778)
Time Spent: 2h 40m  (was: 2.5h)

> Refine Spark runner dependencies
> 
>
> Key: BEAM-7878
> URL: https://issues.apache.org/jira/browse/BEAM-7878
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Minor
> Fix For: 2.15.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> The Spark runner has more dependencies than it needs:
>  * The jackson_module_scala dependency is marked as compile but it is only 
> needed at runtime so it ends up leaking this module.
>  * The dropwizard, scala-library and commons-compress dependencies are 
> declared but unused.
>  * The Kafka dependency version is not aligned with the rest of Beam.
>  * The Kafka and zookeeper dependencies are used only for the tests but 
> currently are provided



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-7878) Refine Spark runner dependencies

2019-08-11 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/BEAM-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-7878:
---
Status: Open  (was: Triage Needed)

> Refine Spark runner dependencies
> 
>
> Key: BEAM-7878
> URL: https://issues.apache.org/jira/browse/BEAM-7878
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Minor
> Fix For: 2.15.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> The Spark runner has more dependencies than it needs:
>  * The jackson_module_scala dependency is marked as compile but it is only 
> needed at runtime so it ends up leaking this module.
>  * The dropwizard, scala-library and commons-compress dependencies are 
> declared but unused.
>  * The Kafka dependency version is not aligned with the rest of Beam.
>  * The Kafka and zookeeper dependencies are used only for the tests but 
> currently are provided



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7878) Refine Spark runner dependencies

2019-08-11 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7878?focusedWorklogId=292777&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-292777
 ]

ASF GitHub Bot logged work on BEAM-7878:


Author: ASF GitHub Bot
Created on: 11/Aug/19 21:35
Start Date: 11/Aug/19 21:35
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #9310: 
[release-2.15.0][BEAM-7878] Refine Spark runner dependencies
URL: https://github.com/apache/beam/pull/9310#issuecomment-520262859
 
 
   Run Spark ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 292777)
Time Spent: 2.5h  (was: 2h 20m)

> Refine Spark runner dependencies
> 
>
> Key: BEAM-7878
> URL: https://issues.apache.org/jira/browse/BEAM-7878
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Minor
> Fix For: 2.15.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The Spark runner has more dependencies than it needs:
>  * The jackson_module_scala dependency is marked as compile but it is only 
> needed at runtime so it ends up leaking this module.
>  * The dropwizard, scala-library and commons-compress dependencies are 
> declared but unused.
>  * The Kafka dependency version is not aligned with the rest of Beam.
>  * The Kafka and zookeeper dependencies are used only for the tests but 
> currently are provided



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7878) Refine Spark runner dependencies

2019-08-11 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7878?focusedWorklogId=292776&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-292776
 ]

ASF GitHub Bot logged work on BEAM-7878:


Author: ASF GitHub Bot
Created on: 11/Aug/19 21:35
Start Date: 11/Aug/19 21:35
Worklog Time Spent: 10m 
  Work Description: iemejia commented on pull request #9310: 
[release-2.15.0][BEAM-7878] Refine Spark runner dependencies
URL: https://github.com/apache/beam/pull/9310
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 292776)
Time Spent: 2h 20m  (was: 2h 10m)

> Refine Spark runner dependencies
> 
>
> Key: BEAM-7878
> URL: https://issues.apache.org/jira/browse/BEAM-7878
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Minor
> Fix For: 2.15.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> The Spark runner has more dependencies than it needs:
>  * The jackson_module_scala dependency is marked as compile but it is only 
> needed at runtime so it ends up leaking this module.
>  * The dropwizard, scala-library and commons-compress dependencies are 
> declared but unused.
>  * The Kafka dependency version is not aligned with the rest of Beam.
>  * The Kafka and zookeeper dependencies are used only for the tests but 
> currently are provided



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Reopened] (BEAM-7878) Refine Spark runner dependencies

2019-08-11 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/BEAM-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía reopened BEAM-7878:


Opening to cherry pick it into release 2.15.0 because this addresses an issue 
with Spark runner on Spark 2.12

> Refine Spark runner dependencies
> 
>
> Key: BEAM-7878
> URL: https://issues.apache.org/jira/browse/BEAM-7878
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Minor
> Fix For: 2.15.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The Spark runner has more dependencies than it needs:
>  * The jackson_module_scala dependency is marked as compile but it is only 
> needed at runtime so it ends up leaking this module.
>  * The dropwizard, scala-library and commons-compress dependencies are 
> declared but unused.
>  * The Kafka dependency version is not aligned with the rest of Beam.
>  * The Kafka and zookeeper dependencies are used only for the tests but 
> currently are provided



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-7878) Refine Spark runner dependencies

2019-08-11 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/BEAM-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-7878:
---
Fix Version/s: (was: 2.16.0)
   2.15.0

> Refine Spark runner dependencies
> 
>
> Key: BEAM-7878
> URL: https://issues.apache.org/jira/browse/BEAM-7878
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Minor
> Fix For: 2.15.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The Spark runner has more dependencies than it needs:
>  * The jackson_module_scala dependency is marked as compile but it is only 
> needed at runtime so it ends up leaking this module.
>  * The dropwizard, scala-library and commons-compress dependencies are 
> declared but unused.
>  * The Kafka dependency version is not aligned with the rest of Beam.
>  * The Kafka and zookeeper dependencies are used only for the tests but 
> currently are provided



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (BEAM-7833) warn user when --region flag is not explicitly set

2019-08-11 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/BEAM-7833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía resolved BEAM-7833.

Resolution: Fixed

> warn user when --region flag is not explicitly set
> --
>
> Key: BEAM-7833
> URL: https://issues.apache.org/jira/browse/BEAM-7833
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
> Fix For: 2.15.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-6451) Portability Pipeline eventually hangs on bundle registration

2019-08-11 Thread Oded Valtzer (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-6451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16904727#comment-16904727
 ] 

Oded Valtzer commented on BEAM-6451:


did someone have a look on this issue? we are having those "processing stuck" 
on operations and we can't understand why this happens. it's on 2.14 python sdk 
on dataflow. mainly happening when we have intense-long running steps...

> Portability Pipeline eventually hangs on bundle registration
> 
>
> Key: BEAM-6451
> URL: https://issues.apache.org/jira/browse/BEAM-6451
> Project: Beam
>  Issue Type: Bug
>  Components: java-fn-execution, runner-dataflow, sdk-py-harness
>Reporter: Scott Wegner
>Assignee: Mikhail Gryzykhin
>Priority: Minor
>  Labels: portability, triaged
>
> We've seen jobs using portability start off in a healthy state, but then 
> eventually get stuck and hang on bundle registration. We see error logs from 
> the worker harness:
> {code}
> Processing stuck in step s01 for at least 06h30m00s without outputting or 
> completing in state finish at 
> sun.misc.Unsafe.park(Native Method) at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
>  at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
>  at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at 
> org.apache.beam.sdk.util.MoreFutures.get(MoreFutures.java:57) at 
> org.apache.beam.runners.dataflow.worker.fn.control.RegisterAndProcessBundleOperation.finish(RegisterAndProcessBundleOperation.java:277)
>  at 
> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:85)
>  at 
> org.apache.beam.runners.dataflow.worker.fn.control.BeamFnMapTaskExecutor.execute(BeamFnMapTaskExecutor.java:119)
>  at 
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1226)
>  at 
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:141)
>  at 
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:965)
>  at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at 
> java.lang.Thread.run(Thread.java:745)
> {code}
> Looking at [the 
> code|https://github.com/apache/beam/blob/release-2.8.0/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/fn/control/RegisterAndProcessBundleOperation.java#L277],
>  it looks like there are no timeouts on the Bundle Registration calls over 
> the FnApi, which contributes to this hanging forever rather than giving a 
> better failure.
> This bug report came from a customer running a python streaming pipeline 
> using the new portability framework on Dataflow. Hopefully we can repro on 
> our own in order to link to the job / logs.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-3489) Expose the message id of received messages within PubsubMessage

2019-08-11 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3489?focusedWorklogId=292730&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-292730
 ]

ASF GitHub Bot logged work on BEAM-3489:


Author: ASF GitHub Bot
Created on: 11/Aug/19 17:28
Start Date: 11/Aug/19 17:28
Worklog Time Spent: 10m 
  Work Description: thinhha commented on issue #8370: [BEAM-3489] add 
PubSub messageId in PubsubMessage
URL: https://github.com/apache/beam/pull/8370#issuecomment-520245762
 
 
   Thanks for the update @lukecwik. Good thing the IT caught this!
   
   I've added the changes we discussed above. We can rerun the postCommit tests 
once the dataflow backend change is complete.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 292730)
Time Spent: 7h 20m  (was: 7h 10m)

> Expose the message id of received messages within PubsubMessage
> ---
>
> Key: BEAM-3489
> URL: https://issues.apache.org/jira/browse/BEAM-3489
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Luke Cwik
>Assignee: Thinh Ha
>Priority: Minor
>  Labels: newbie, starter
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> This task is about passing forward the message id from the pubsub proto to 
> the java PubsubMessage.
> Add a message id field to PubsubMessage.
> Update the coder for PubsubMessage to encode the message id.
> Update the translation from the Pubsub proto message to the Dataflow message:
> https://github.com/apache/beam/blob/2e275264b21db45787833502e5e42907b05e28b8/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubUnboundedSource.java#L976



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-3489) Expose the message id of received messages within PubsubMessage

2019-08-11 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3489?focusedWorklogId=292729&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-292729
 ]

ASF GitHub Bot logged work on BEAM-3489:


Author: ASF GitHub Bot
Created on: 11/Aug/19 17:27
Start Date: 11/Aug/19 17:27
Worklog Time Spent: 10m 
  Work Description: thinhha commented on issue #8370: [BEAM-3489] add 
PubSub messageId in PubsubMessage
URL: https://github.com/apache/beam/pull/8370#issuecomment-520245762
 
 
   Thanks for the update @lukecwik! Good think the IT caught this.
   
   I've added the changes we discussed above. We can rerun the postCommit tests 
once the dataflow backend change is complete.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 292729)
Time Spent: 7h 10m  (was: 7h)

> Expose the message id of received messages within PubsubMessage
> ---
>
> Key: BEAM-3489
> URL: https://issues.apache.org/jira/browse/BEAM-3489
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Luke Cwik
>Assignee: Thinh Ha
>Priority: Minor
>  Labels: newbie, starter
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> This task is about passing forward the message id from the pubsub proto to 
> the java PubsubMessage.
> Add a message id field to PubsubMessage.
> Update the coder for PubsubMessage to encode the message id.
> Update the translation from the Pubsub proto message to the Dataflow message:
> https://github.com/apache/beam/blob/2e275264b21db45787833502e5e42907b05e28b8/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubUnboundedSource.java#L976



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7819) PubsubMessage message parsing is lacking non-attribute fields

2019-08-11 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7819?focusedWorklogId=292711&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-292711
 ]

ASF GitHub Bot logged work on BEAM-7819:


Author: ASF GitHub Bot
Created on: 11/Aug/19 14:26
Start Date: 11/Aug/19 14:26
Worklog Time Spent: 10m 
  Work Description: matt-darwin commented on issue #9232: [BEAM-7819] 
Python - parse PubSub message_id into attributes property
URL: https://github.com/apache/beam/pull/9232#issuecomment-520232592
 
 
   Need to fix up some tests, and the PubSubMessage tests need amending to be 
in line with those. Haven't had chance to test with the dataflowrunner as yet.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 292711)
Time Spent: 4h  (was: 3h 50m)

> PubsubMessage message parsing is lacking non-attribute fields
> -
>
> Key: BEAM-7819
> URL: https://issues.apache.org/jira/browse/BEAM-7819
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp
>Reporter: Ahmet Altay
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> User reported issue: 
> https://lists.apache.org/thread.html/139b0c15abc6471a2e2202d76d915c645a529a23ecc32cd9cfecd315@%3Cuser.beam.apache.org%3E
> """
> Looking at the source code, with my untrained python eyes, I think if the 
> intention is to include the message id and the publish time in the attributes 
> attribute of the PubSubMessage type, then the protobuf mapping is missing 
> something:-
> @staticmethod
> def _from_proto_str(proto_msg):
> """Construct from serialized form of ``PubsubMessage``.
> Args:
> proto_msg: String containing a serialized protobuf of type
> https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage
> Returns:
> A new PubsubMessage object.
> """
> msg = pubsub.types.pubsub_pb2.PubsubMessage()
> msg.ParseFromString(proto_msg)
> # Convert ScalarMapContainer to dict.
> attributes = dict((key, msg.attributes[key]) for key in msg.attributes)
> return PubsubMessage(msg.data, attributes)
> The protobuf definition is here:-
> https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage
> and so it looks as if the message_id and publish_time are not being parsed as 
> they are seperate from the attributes. Perhaps the PubsubMessage class needs 
> expanding to include these as attributes, or they would need adding to the 
> dictionary for attributes. This would only need doing for the _from_proto_str 
> as obviously they would not need to be populated when transmitting a message 
> to PubSub.
> My python is not great, I'm assuming the latter option would need to look 
> something like this?
> attributes = dict((key, msg.attributes[key]) for key in msg.attributes)
> attributes.update({'message_id': msg.message_id, 'publish_time': 
> msg.publish_time})
> return PubsubMessage(msg.data, attributes)
> """



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-7808) Add method to convert avro field to beam field

2019-08-11 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/BEAM-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-7808:
---
Fix Version/s: 2.15.0

> Add method to convert avro field to beam field
> --
>
> Key: BEAM-7808
> URL: https://issues.apache.org/jira/browse/BEAM-7808
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Vishwas
>Assignee: Vishwas
>Priority: Minor
> Fix For: 2.15.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Currently we have static methods in AvroUtils class, which exposes methods 
> like
> toBeamSchema(), toBeamRowStrict().  These methods work only on generic 
> record. 
> Method is now added which will also support converting avroField to BeamField.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (BEAM-7808) Add method to convert avro field to beam field

2019-08-11 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/BEAM-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía resolved BEAM-7808.

Resolution: Fixed

> Add method to convert avro field to beam field
> --
>
> Key: BEAM-7808
> URL: https://issues.apache.org/jira/browse/BEAM-7808
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Vishwas
>Assignee: Vishwas
>Priority: Minor
> Fix For: 2.15.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Currently we have static methods in AvroUtils class, which exposes methods 
> like
> toBeamSchema(), toBeamRowStrict().  These methods work only on generic 
> record. 
> Method is now added which will also support converting avroField to BeamField.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-6777) SDK Harness Resilience

2019-08-11 Thread Oded Valtzer (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16904657#comment-16904657
 ] 

Oded Valtzer commented on BEAM-6777:


Hey guys,
quick question on this, we are experiencing OOMs after more then 1 days of 
running the pipeline. we do intense CPU\Memory computations in single step in 
the pipeline and at some point one or more workers reach OOM.
at this point the worker is being killed by windmill (we run python 2.7 
streaming on dataflow, beam 2.14). 
Once the workers get into this state they never recover and reach what you 
describe in the description of this ticker..i failed to understand what is the 
status of the ticket, can you briefly explain?

Thanks for working on this
Oded

> SDK Harness Resilience
> --
>
> Key: BEAM-6777
> URL: https://issues.apache.org/jira/browse/BEAM-6777
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Sam Rohde
>Assignee: Yueyang Qiu
>Priority: Major
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> If the Python SDK Harness crashes in any way (user code exception, OOM, etc) 
> the job will hang and waste resources. The fix is to add a daemon in the SDK 
> Harness and Runner Harness to communicate with Dataflow to restart the VM 
> when stuckness is detected.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7696) If classpath directory contains a directory it causes exception in Spark runner staging

2019-08-11 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7696?focusedWorklogId=292685&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-292685
 ]

ASF GitHub Bot logged work on BEAM-7696:


Author: ASF GitHub Bot
Created on: 11/Aug/19 11:19
Start Date: 11/Aug/19 11:19
Worklog Time Spent: 10m 
  Work Description: yanlin-Lynn commented on issue #9019: [BEAM-7696] 
Prepare files to stage also in local master of spark runner.
URL: https://github.com/apache/beam/pull/9019#issuecomment-520220096
 
 
   > I suggest that we re-use the code in Dataflow and to automatically convert 
directory classpath entries into jars:
   > 
https://github.com/apache/beam/blob/3e97543d53cf30b7ac072e225d358e8436784220/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/util/PackageUtil.java#L256
   > 
   > It will likely require some refactoring to make it available to more 
runners.
   
   Currently, there exist `detectClassPathResourcesToStage` and 
`prepareFilesForStaging` in 
[PipelineResources](https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/PipelineResources.java#L51).
 Do you mean to do the packaging of directories inside 
`detectClassPathResourcesToStage`, so no need to call `prepareFilesForStaging` 
manually?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 292685)
Time Spent: 4h 20m  (was: 4h 10m)

> If classpath directory contains a directory it causes exception in Spark 
> runner staging
> ---
>
> Key: BEAM-7696
> URL: https://issues.apache.org/jira/browse/BEAM-7696
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Wang Yanlin
>Assignee: Wang Yanlin
>Priority: Minor
> Fix For: 2.15.0
>
> Attachments: addJar_exception.jpg, files_contains_dir.jpg
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Run the unit test  SparkPipelineStateTest.testBatchPipelineRunningState in 
> IntelliJ IDEA on my mac, get the IllegalArgumentException in the console 
> output. I check the source code, and find the result of 
> _PipelineResources.detectClassPathResourcesToStage_ contains directory, which 
> is the cause of the exception.
> See the attached file 'addJar_exception.jpg' for detail, and the result of 
> _PipelineResources.detectClassPathResourcesToStage_
> is showed in attached file 'files_contains_dir.jpg' during debug.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7742) BigQuery File Loads to work well with load job size limits

2019-08-11 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7742?focusedWorklogId=292653&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-292653
 ]

ASF GitHub Bot logged work on BEAM-7742:


Author: ASF GitHub Bot
Created on: 11/Aug/19 07:26
Start Date: 11/Aug/19 07:26
Worklog Time Spent: 10m 
  Work Description: ttanay commented on issue #9242: [BEAM-7742] Partition 
files in BQFL to cater to quotas & limits
URL: https://github.com/apache/beam/pull/9242#issuecomment-520206704
 
 
   Run Python 2 PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 292653)
Time Spent: 1h 10m  (was: 1h)

> BigQuery File Loads to work well with load job size limits
> --
>
> Key: BEAM-7742
> URL: https://issues.apache.org/jira/browse/BEAM-7742
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp
>Reporter: Pablo Estrada
>Assignee: Tanay Tummalapalli
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Load jobs into BigQuery have a number of limitations: 
> [https://cloud.google.com/bigquery/quotas#load_jobs]
>  
> Currently, the python BQ sink implemented in `bigquery_file_loads.py` does 
> not handle these limitations well. Improvements need to be made to the 
> miplementation, to:
>  * Decide to use temp_tables dynamically at pipeline execution
>  * Add code to determine when a load job to a single destination needs to be 
> partitioned into multiple jobs.
>  * When this happens, then we definitely need to use temp_tables, in case one 
> of the two load jobs fails, and the pipeline is rerun.
> Tanay, would you be able to look at this?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)