[jira] [Created] (BEAM-10861) Adds URNs and payloads to PubSub transforms

2020-09-08 Thread Chamikara Madhusanka Jayalath (Jira)
Chamikara Madhusanka Jayalath created BEAM-10861:


 Summary: Adds URNs and payloads to PubSub transforms
 Key: BEAM-10861
 URL: https://issues.apache.org/jira/browse/BEAM-10861
 Project: Beam
  Issue Type: Bug
  Components: cross-language, runner-dataflow, sdk-py-core
Reporter: Chamikara Madhusanka Jayalath


This is needed to allow runners to override portable definition of transforms.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8376) Add FirestoreIO connector to Java SDK

2020-09-09 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17193256#comment-17193256
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-8376:
-

Firestore team is currently looking into this.

> Add FirestoreIO connector to Java SDK
> -
>
> Key: BEAM-8376
> URL: https://issues.apache.org/jira/browse/BEAM-8376
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Stefan Djelekar
>Priority: P3
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Motivation:
> There is no Firestore connector for Java SDK at the moment.
> Having it will enhance the integrations with database options on the Google 
> Cloud Platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-2717) Infer coders in SDK prior to handing off pipeline to Runner

2020-10-05 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208215#comment-17208215
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-2717:
-

I'm not currently working on this. This sounds like a simplification we can do 
but not a blocker for portable job submission.

> Infer coders in SDK prior to handing off pipeline to Runner
> ---
>
> Key: BEAM-2717
> URL: https://issues.apache.org/jira/browse/BEAM-2717
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Robert Bradshaw
>Priority: P1
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Currently all runners have to duplicate this work, and there's also a hack 
> storing the element type rather than the coder in the Runner protos.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-11026) When determining expansion service jar, depend on Maven coords instead of Gradle target

2020-10-06 Thread Chamikara Madhusanka Jayalath (Jira)
Chamikara Madhusanka Jayalath created BEAM-11026:


 Summary: When determining expansion service jar, depend on Maven 
coords instead of Gradle target
 Key: BEAM-11026
 URL: https://issues.apache.org/jira/browse/BEAM-11026
 Project: Beam
  Issue Type: Bug
  Components: cross-language, sdk-py-core
Reporter: Chamikara Madhusanka Jayalath


For example here: 
https://github.com/apache/beam/blob/release-2.24.0/sdks/python/apache_beam/io/kafka.py#L107

Maven coords is more stable than the Gradle target name so it's better to 
depend on the prior. See [1] for details.

[1] https://issues.apache.org/jira/browse/BEAM-10986

 

cc: [~robertwb] [~bhulette] [~kenn]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-11026) When determining expansion service jar, depend on Maven coords instead of Gradle target

2020-10-06 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-11026:
-
Issue Type: Improvement  (was: Bug)

> When determining expansion service jar, depend on Maven coords instead of 
> Gradle target
> ---
>
> Key: BEAM-11026
> URL: https://issues.apache.org/jira/browse/BEAM-11026
> Project: Beam
>  Issue Type: Improvement
>  Components: cross-language, sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Priority: P2
>
> For example here: 
> https://github.com/apache/beam/blob/release-2.24.0/sdks/python/apache_beam/io/kafka.py#L107
> Maven coords is more stable than the Gradle target name so it's better to 
> depend on the prior. See [1] for details.
> [1] https://issues.apache.org/jira/browse/BEAM-10986
>  
> cc: [~robertwb] [~bhulette] [~kenn]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10986) :build task doesn't build shaded jar with shadow >5.0.0

2020-10-06 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209129#comment-17209129
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10986:
--

Thanks for root causing and fixing this Brian.

Filed https://issues.apache.org/jira/browse/BEAM-11026 for depending on Maven 
coords instead of Gradle target.

> :build task doesn't build shaded jar with shadow >5.0.0
> ---
>
> Key: BEAM-10986
> URL: https://issues.apache.org/jira/browse/BEAM-10986
> Project: Beam
>  Issue Type: Bug
>  Components: build-system
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: P0
> Fix For: 2.25.0
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Not 100% sure but I believe this will prevent us from releasing shaded jars.
> See 
> [https://lists.apache.org/thread.html/r12c785c2b01017329204ad816740e80b70c19c5f2bb1ea01535a987d%40%3Cdev.beam.apache.org%3E]
>  for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-11033) Update Dataflow metrics processor to handle portable jobs

2020-10-06 Thread Chamikara Madhusanka Jayalath (Jira)
Chamikara Madhusanka Jayalath created BEAM-11033:


 Summary: Update Dataflow metrics processor to handle portable jobs
 Key: BEAM-11033
 URL: https://issues.apache.org/jira/browse/BEAM-11033
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Reporter: Chamikara Madhusanka Jayalath
Assignee: Chamikara Madhusanka Jayalath


Currently, Dataflow metrics processor expects Dataflow internal step names 
generated for v1beta3 job description in metrics returned by Dataflow service: 
[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/dataflow_metrics.py#L97]

 

But with portable job submission, Dataflow uses PTransform ID (in proto 
pipeline) as the internal step name. Hence metrics processor should be updated 
to handle this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-11033) Update Dataflow metrics processor to handle portable jobs

2020-10-10 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-11033:
-
Status: Resolved  (was: Open)

> Update Dataflow metrics processor to handle portable jobs
> -
>
> Key: BEAM-11033
> URL: https://issues.apache.org/jira/browse/BEAM-11033
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P2
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Currently, Dataflow metrics processor expects Dataflow internal step names 
> generated for v1beta3 job description in metrics returned by Dataflow 
> service: 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/dataflow_metrics.py#L97]
>  
> But with portable job submission, Dataflow uses PTransform ID (in proto 
> pipeline) as the internal step name. Hence metrics processor should be 
> updated to handle this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10904) Set sdk_harness_container_images property for all Dataflow Runner v2 jobs

2020-10-16 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215536#comment-17215536
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10904:
--

This was done in https://github.com/apache/beam/pull/12853.

> Set sdk_harness_container_images property for all Dataflow Runner v2 jobs
> -
>
> Key: BEAM-10904
> URL: https://issues.apache.org/jira/browse/BEAM-10904
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P2
>  Labels: stale-assigned
>
> Currently this is only set for x-lang jobs.
> We can set this for all jobs now that corresponding Dataflow API is stable.
> This will make container setup for x-lang and non-x-lang jobs for runner V2 
> behave in a similar manner.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10904) Set sdk_harness_container_images property for all Dataflow Runner v2 jobs

2020-10-16 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-10904:
-
Status: Resolved  (was: Open)

> Set sdk_harness_container_images property for all Dataflow Runner v2 jobs
> -
>
> Key: BEAM-10904
> URL: https://issues.apache.org/jira/browse/BEAM-10904
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P2
>  Labels: stale-assigned
>
> Currently this is only set for x-lang jobs.
> We can set this for all jobs now that corresponding Dataflow API is stable.
> This will make container setup for x-lang and non-x-lang jobs for runner V2 
> behave in a similar manner.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-11002) XmlIO buffer overflow exception

2020-10-20 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath reassigned BEAM-11002:


Assignee: Chamikara Madhusanka Jayalath

> XmlIO buffer overflow exception 
> 
>
> Key: BEAM-11002
> URL: https://issues.apache.org/jira/browse/BEAM-11002
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Affects Versions: 2.23.0, 2.24.0
>Reporter: Duncan Lew
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P0
>  Labels: Clarified
>
> We're making using of Apache Beam in Google Dataflow.
> We're using XmlIO to read in an XML file with such a setup
> {code:java}
> pipeline
> .apply("Read Storage Bucket",
> XmlIO.read()
> .from(sourcePath)
> .withRootElement(xmlProductRoot)
> .withRecordElement(xmlProductRecord)
> .withRecordClass(XmlProduct::class.java)
> )
> {code}
> However, from time to time, we're getting buffer overflow exception from 
> reading random xml files:
> {code:java}
> "Error message from worker: java.io.IOException: Failed to start reading from 
> source: gs://path-to-xml-file.xml range [1722550, 2684411)
>   
> org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:610)
>   
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:359)
>   
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:194)
>   
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
>   
> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
>   
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:417)
>   
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:386)
>   
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:311)
>   
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
>   
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
>   
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
>   java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.nio.BufferOverflowException
>   java.base/java.nio.Buffer.nextPutIndex(Buffer.java:662)
>   java.base/java.nio.HeapByteBuffer.put(HeapByteBuffer.java:196)
>   
> org.apache.beam.sdk.io.xml.XmlSource$XMLReader.getFirstOccurenceOfRecordElement(XmlSource.java:285)
>   
> org.apache.beam.sdk.io.xml.XmlSource$XMLReader.startReading(XmlSource.java:192)
>   
> org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:476)
>   
> org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:249)
>   
> org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:607)
>   ... 14 more
> {code}
> We can't reproduce this buffer overflow exception locally with the 
> DirectRunner. If we rerun the dataflow job in the Google Cloud, it can run 
> correctly without any exceptions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-11002) XmlIO buffer overflow exception

2020-10-20 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-11002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217662#comment-17217662
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-11002:
--

This should not be a P0 unless something was broken between releases.

Is there an input that can be used to reproduce this ?

> XmlIO buffer overflow exception 
> 
>
> Key: BEAM-11002
> URL: https://issues.apache.org/jira/browse/BEAM-11002
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Affects Versions: 2.23.0, 2.24.0
>Reporter: Duncan Lew
>Priority: P0
>  Labels: Clarified
>
> We're making using of Apache Beam in Google Dataflow.
> We're using XmlIO to read in an XML file with such a setup
> {code:java}
> pipeline
> .apply("Read Storage Bucket",
> XmlIO.read()
> .from(sourcePath)
> .withRootElement(xmlProductRoot)
> .withRecordElement(xmlProductRecord)
> .withRecordClass(XmlProduct::class.java)
> )
> {code}
> However, from time to time, we're getting buffer overflow exception from 
> reading random xml files:
> {code:java}
> "Error message from worker: java.io.IOException: Failed to start reading from 
> source: gs://path-to-xml-file.xml range [1722550, 2684411)
>   
> org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:610)
>   
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:359)
>   
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:194)
>   
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
>   
> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
>   
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:417)
>   
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:386)
>   
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:311)
>   
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
>   
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
>   
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
>   java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.nio.BufferOverflowException
>   java.base/java.nio.Buffer.nextPutIndex(Buffer.java:662)
>   java.base/java.nio.HeapByteBuffer.put(HeapByteBuffer.java:196)
>   
> org.apache.beam.sdk.io.xml.XmlSource$XMLReader.getFirstOccurenceOfRecordElement(XmlSource.java:285)
>   
> org.apache.beam.sdk.io.xml.XmlSource$XMLReader.startReading(XmlSource.java:192)
>   
> org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:476)
>   
> org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:249)
>   
> org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:607)
>   ... 14 more
> {code}
> We can't reproduce this buffer overflow exception locally with the 
> DirectRunner. If we rerun the dataflow job in the Google Cloud, it can run 
> correctly without any exceptions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-11002) XmlIO buffer overflow exception

2020-10-20 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-11002:
-
Priority: P1  (was: P0)

> XmlIO buffer overflow exception 
> 
>
> Key: BEAM-11002
> URL: https://issues.apache.org/jira/browse/BEAM-11002
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Affects Versions: 2.23.0, 2.24.0
>Reporter: Duncan Lew
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P1
>  Labels: Clarified
>
> We're making using of Apache Beam in Google Dataflow.
> We're using XmlIO to read in an XML file with such a setup
> {code:java}
> pipeline
> .apply("Read Storage Bucket",
> XmlIO.read()
> .from(sourcePath)
> .withRootElement(xmlProductRoot)
> .withRecordElement(xmlProductRecord)
> .withRecordClass(XmlProduct::class.java)
> )
> {code}
> However, from time to time, we're getting buffer overflow exception from 
> reading random xml files:
> {code:java}
> "Error message from worker: java.io.IOException: Failed to start reading from 
> source: gs://path-to-xml-file.xml range [1722550, 2684411)
>   
> org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:610)
>   
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:359)
>   
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:194)
>   
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
>   
> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
>   
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:417)
>   
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:386)
>   
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:311)
>   
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
>   
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
>   
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
>   java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.nio.BufferOverflowException
>   java.base/java.nio.Buffer.nextPutIndex(Buffer.java:662)
>   java.base/java.nio.HeapByteBuffer.put(HeapByteBuffer.java:196)
>   
> org.apache.beam.sdk.io.xml.XmlSource$XMLReader.getFirstOccurenceOfRecordElement(XmlSource.java:285)
>   
> org.apache.beam.sdk.io.xml.XmlSource$XMLReader.startReading(XmlSource.java:192)
>   
> org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:476)
>   
> org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:249)
>   
> org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:607)
>   ... 14 more
> {code}
> We can't reproduce this buffer overflow exception locally with the 
> DirectRunner. If we rerun the dataflow job in the Google Cloud, it can run 
> correctly without any exceptions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-11002) XmlIO buffer overflow exception

2020-10-20 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-11002:
-
Component/s: (was: sdk-java-core)
 io-java-xml

> XmlIO buffer overflow exception 
> 
>
> Key: BEAM-11002
> URL: https://issues.apache.org/jira/browse/BEAM-11002
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-xml
>Affects Versions: 2.23.0, 2.24.0
>Reporter: Duncan Lew
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P1
>  Labels: Clarified
>
> We're making using of Apache Beam in Google Dataflow.
> We're using XmlIO to read in an XML file with such a setup
> {code:java}
> pipeline
> .apply("Read Storage Bucket",
> XmlIO.read()
> .from(sourcePath)
> .withRootElement(xmlProductRoot)
> .withRecordElement(xmlProductRecord)
> .withRecordClass(XmlProduct::class.java)
> )
> {code}
> However, from time to time, we're getting buffer overflow exception from 
> reading random xml files:
> {code:java}
> "Error message from worker: java.io.IOException: Failed to start reading from 
> source: gs://path-to-xml-file.xml range [1722550, 2684411)
>   
> org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:610)
>   
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:359)
>   
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:194)
>   
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
>   
> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
>   
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:417)
>   
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:386)
>   
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:311)
>   
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
>   
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
>   
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
>   java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.nio.BufferOverflowException
>   java.base/java.nio.Buffer.nextPutIndex(Buffer.java:662)
>   java.base/java.nio.HeapByteBuffer.put(HeapByteBuffer.java:196)
>   
> org.apache.beam.sdk.io.xml.XmlSource$XMLReader.getFirstOccurenceOfRecordElement(XmlSource.java:285)
>   
> org.apache.beam.sdk.io.xml.XmlSource$XMLReader.startReading(XmlSource.java:192)
>   
> org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:476)
>   
> org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:249)
>   
> org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:607)
>   ... 14 more
> {code}
> We can't reproduce this buffer overflow exception locally with the 
> DirectRunner. If we rerun the dataflow job in the Google Cloud, it can run 
> correctly without any exceptions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-11002) XmlIO buffer overflow exception

2020-10-20 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-11002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217693#comment-17217693
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-11002:
--

Based on the changes, doesn't look like something released to XmlIO changed 
recently: 
https://github.com/apache/beam/commits/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlIO.java

> XmlIO buffer overflow exception 
> 
>
> Key: BEAM-11002
> URL: https://issues.apache.org/jira/browse/BEAM-11002
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-xml
>Affects Versions: 2.23.0, 2.24.0
>Reporter: Duncan Lew
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P1
>  Labels: Clarified
>
> We're making using of Apache Beam in Google Dataflow.
> We're using XmlIO to read in an XML file with such a setup
> {code:java}
> pipeline
> .apply("Read Storage Bucket",
> XmlIO.read()
> .from(sourcePath)
> .withRootElement(xmlProductRoot)
> .withRecordElement(xmlProductRecord)
> .withRecordClass(XmlProduct::class.java)
> )
> {code}
> However, from time to time, we're getting buffer overflow exception from 
> reading random xml files:
> {code:java}
> "Error message from worker: java.io.IOException: Failed to start reading from 
> source: gs://path-to-xml-file.xml range [1722550, 2684411)
>   
> org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:610)
>   
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:359)
>   
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:194)
>   
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
>   
> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
>   
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:417)
>   
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:386)
>   
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:311)
>   
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
>   
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
>   
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
>   java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.nio.BufferOverflowException
>   java.base/java.nio.Buffer.nextPutIndex(Buffer.java:662)
>   java.base/java.nio.HeapByteBuffer.put(HeapByteBuffer.java:196)
>   
> org.apache.beam.sdk.io.xml.XmlSource$XMLReader.getFirstOccurenceOfRecordElement(XmlSource.java:285)
>   
> org.apache.beam.sdk.io.xml.XmlSource$XMLReader.startReading(XmlSource.java:192)
>   
> org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:476)
>   
> org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:249)
>   
> org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:607)
>   ... 14 more
> {code}
> We can't reproduce this buffer overflow exception locally with the 
> DirectRunner. If we rerun the dataflow job in the Google Cloud, it can run 
> correctly without any exceptions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10967) adding validate runner for Dataflow runner v2 to Java SDK

2020-10-20 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217858#comment-17217858
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10967:
--

What's the next step here ? Test suite does not seem to be running - 
https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_V2/

> adding validate runner for Dataflow runner v2 to Java SDK
> -
>
> Key: BEAM-10967
> URL: https://issues.apache.org/jira/browse/BEAM-10967
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow, sdk-java-harness
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: P2
>  Labels: Clarified
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> adding validate runner for Dataflow runner v2 to Java SDK (Java harness)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10967) adding validate runner for Dataflow runner v2 to Java SDK

2020-10-20 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217891#comment-17217891
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10967:
--

Thanks. cc: [~kenn] [~tysonjh]

> adding validate runner for Dataflow runner v2 to Java SDK
> -
>
> Key: BEAM-10967
> URL: https://issues.apache.org/jira/browse/BEAM-10967
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow, sdk-java-harness
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: P2
>  Labels: Clarified
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> adding validate runner for Dataflow runner v2 to Java SDK (Java harness)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10967) adding validate runner for Dataflow runner v2 to Java SDK

2020-10-20 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217951#comment-17217951
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10967:
--

Assigning to Tyson to disable the new/falling tests here and resolve this when 
appropriate.

> adding validate runner for Dataflow runner v2 to Java SDK
> -
>
> Key: BEAM-10967
> URL: https://issues.apache.org/jira/browse/BEAM-10967
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow, sdk-java-harness
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: P2
>  Labels: Clarified
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> adding validate runner for Dataflow runner v2 to Java SDK (Java harness)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-10967) adding validate runner for Dataflow runner v2 to Java SDK

2020-10-20 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath reassigned BEAM-10967:


Assignee: Tyson Hamilton  (was: Heejong Lee)

> adding validate runner for Dataflow runner v2 to Java SDK
> -
>
> Key: BEAM-10967
> URL: https://issues.apache.org/jira/browse/BEAM-10967
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow, sdk-java-harness
>Reporter: Heejong Lee
>Assignee: Tyson Hamilton
>Priority: P2
>  Labels: Clarified
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> adding validate runner for Dataflow runner v2 to Java SDK (Java harness)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-10776) Unwanted JDK jars staged when running cross-language pipelines

2020-08-20 Thread Chamikara Madhusanka Jayalath (Jira)
Chamikara Madhusanka Jayalath created BEAM-10776:


 Summary: Unwanted JDK jars staged when running cross-language 
pipelines
 Key: BEAM-10776
 URL: https://issues.apache.org/jira/browse/BEAM-10776
 Project: Beam
  Issue Type: Bug
  Components: cross-language
Reporter: Chamikara Madhusanka Jayalath


When running cross-language Kafka on Dataflow I see following jars being staged.

INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar...

INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar
 in 40 seconds.

INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar...

INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar
 in 177 seconds.

INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar...

INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar
 in 285 seconds.

INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar...

INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar
 in 0 seconds.

INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar...

INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar
 in 16 seconds.

INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar...

INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar
 in 0 seconds.

INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar...

INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar
 in 0 seconds.

INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar...

 

Out of these we just need 
'beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar'.
 Rest seems to be due to us including all jars from classpath in the expansion 
service response.

 

[https://github.com/apache/beam/blob/master/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionService.java#L407]

 

We should figure out a way to filter out these additional jars.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10776) Unwanted JDK jars staged when running cross-language pipelines

2020-08-20 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181446#comment-17181446
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10776:
--

cc: [~heejong] [~robertwb] [~lcwik] any ideas on on a way to the correct set of 
jars to stage without additional jars from the classpath ?

> Unwanted JDK jars staged when running cross-language pipelines
> --
>
> Key: BEAM-10776
> URL: https://issues.apache.org/jira/browse/BEAM-10776
> Project: Beam
>  Issue Type: Bug
>  Components: cross-language
>Reporter: Chamikara Madhusanka Jayalath
>Priority: P2
>
> When running cross-language Kafka on Dataflow I see following jars being 
> staged.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar
>  in 40 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar
>  in 177 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar
>  in 285 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar
>  in 0 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar
>  in 16 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar
>  in 0 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar
>  in 0 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar...
>  
> Out of these we just need 
> 'beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar'.
>  Rest seems to be due to us including all jars from classpath in the 
> expansion service response.
>  
> [https://github.com/apache/beam/blob/master/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionService.java#L407]
>  
> We should figure out a way to filter out these additional jars.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10698) SDFs broken for Dataflow runner v2 due to timestamps being out of bound

2020-08-21 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181966#comment-17181966
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10698:
--

Yes, this was resolved.

> SDFs broken for Dataflow runner v2 due to timestamps being out of bound
> ---
>
> Key: BEAM-10698
> URL: https://issues.apache.org/jira/browse/BEAM-10698
> Project: Beam
>  Issue Type: Bug
>  Components: cross-language, io-py-kafka, runner-dataflow
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Boyuan Zhang
>Priority: P0
> Fix For: 2.24.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Seems like this is a regression introduced by 
> [https://github.com/apache/beam/pull/11749]
> I have a short term fix here: [https://github.com/apache/beam/pull/12557]
> We should either.
> (1) Include [https://github.com/apache/beam/pull/12557] in the release branch
> (2) Find the underlying issue that produces wrong timestamps and do that fix
> (3) Revert [https://github.com/apache/beam/pull/11749] from the 2.24.0 
> release branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10698) SDFs broken for Dataflow runner v2 due to timestamps being out of bound

2020-08-21 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-10698:
-
Status: Resolved  (was: Triage Needed)

> SDFs broken for Dataflow runner v2 due to timestamps being out of bound
> ---
>
> Key: BEAM-10698
> URL: https://issues.apache.org/jira/browse/BEAM-10698
> Project: Beam
>  Issue Type: Bug
>  Components: cross-language, io-py-kafka, runner-dataflow
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Boyuan Zhang
>Priority: P0
> Fix For: 2.24.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Seems like this is a regression introduced by 
> [https://github.com/apache/beam/pull/11749]
> I have a short term fix here: [https://github.com/apache/beam/pull/12557]
> We should either.
> (1) Include [https://github.com/apache/beam/pull/12557] in the release branch
> (2) Find the underlying issue that produces wrong timestamps and do that fix
> (3) Revert [https://github.com/apache/beam/pull/11749] from the 2.24.0 
> release branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10776) Unwanted JDK jars staged when running cross-language pipelines

2020-08-21 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182064#comment-17182064
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10776:
--

Thanks Luke. The issue might be Mac specific. For Linux I only see 
'beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-' jar being staged as 
expected.

 

I'm using JDK 8. Expansion services is started as a separate process.

> Unwanted JDK jars staged when running cross-language pipelines
> --
>
> Key: BEAM-10776
> URL: https://issues.apache.org/jira/browse/BEAM-10776
> Project: Beam
>  Issue Type: Bug
>  Components: cross-language
>Reporter: Chamikara Madhusanka Jayalath
>Priority: P2
>
> When running cross-language Kafka on Dataflow I see following jars being 
> staged.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar
>  in 40 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar
>  in 177 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar
>  in 285 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar
>  in 0 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar
>  in 16 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar
>  in 0 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar
>  in 0 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar...
>  
> Out of these we just need 
> 'beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar'.
>  Rest seems to be due to us including all jars from classpath in the 
> expansion service response.
>  
> [https://github.com/apache/beam/blob/master/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionService.java#L407]
>  
> We should figure out a way to filter out these additional jars.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10776) Unwanted JDK jars staged when running cross-language pipelines

2020-08-21 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182198#comment-17182198
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10776:
--

I didn't run into any runtime issues. But it takes a long time to stage all the 
jars so user experience is pretty bad for Mac users it seems.

> Unwanted JDK jars staged when running cross-language pipelines
> --
>
> Key: BEAM-10776
> URL: https://issues.apache.org/jira/browse/BEAM-10776
> Project: Beam
>  Issue Type: Bug
>  Components: cross-language
>Reporter: Chamikara Madhusanka Jayalath
>Priority: P2
>
> When running cross-language Kafka on Dataflow I see following jars being 
> staged.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar
>  in 40 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar
>  in 177 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar
>  in 285 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar
>  in 0 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar
>  in 16 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar
>  in 0 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar
>  in 0 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar...
>  
> Out of these we just need 
> 'beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar'.
>  Rest seems to be due to us including all jars from classpath in the 
> expansion service response.
>  
> [https://github.com/apache/beam/blob/master/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionService.java#L407]
>  
> We should figure out a way to filter out these additional jars.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10776) Unwanted JDK jars staged when running cross-language pipelines

2020-08-21 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182204#comment-17182204
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10776:
--

Does this mean that we always stage these additional jars for production 
Dataflow Java users today on Mac or is that path different somehow ?

> Unwanted JDK jars staged when running cross-language pipelines
> --
>
> Key: BEAM-10776
> URL: https://issues.apache.org/jira/browse/BEAM-10776
> Project: Beam
>  Issue Type: Bug
>  Components: cross-language
>Reporter: Chamikara Madhusanka Jayalath
>Priority: P2
>
> When running cross-language Kafka on Dataflow I see following jars being 
> staged.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar
>  in 40 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar
>  in 177 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar
>  in 285 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar
>  in 0 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar
>  in 16 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar
>  in 0 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar
>  in 0 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar...
>  
> Out of these we just need 
> 'beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar'.
>  Rest seems to be due to us including all jars from classpath in the 
> expansion service response.
>  
> [https://github.com/apache/beam/blob/master/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionService.java#L407]
>  
> We should figure out a way to filter out these additional jars.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10776) Unwanted JDK jars staged when running cross-language pipelines

2020-08-21 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182205#comment-17182205
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10776:
--

(Yes, additional jars are not required. They are just picked up from the 
CLASSPATH and staged unnecessarily).

> Unwanted JDK jars staged when running cross-language pipelines
> --
>
> Key: BEAM-10776
> URL: https://issues.apache.org/jira/browse/BEAM-10776
> Project: Beam
>  Issue Type: Bug
>  Components: cross-language
>Reporter: Chamikara Madhusanka Jayalath
>Priority: P2
>
> When running cross-language Kafka on Dataflow I see following jars being 
> staged.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar
>  in 40 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar
>  in 177 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar
>  in 285 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar
>  in 0 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar
>  in 16 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar
>  in 0 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar...
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar
>  in 0 seconds.
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
> gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar...
>  
> Out of these we just need 
> 'beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar'.
>  Rest seems to be due to us including all jars from classpath in the 
> expansion service response.
>  
> [https://github.com/apache/beam/blob/master/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionService.java#L407]
>  
> We should figure out a way to filter out these additional jars.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10411) Add an example for cross-language Kafka

2020-08-23 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-10411:
-
Status: Resolved  (was: Open)

> Add an example for cross-language Kafka
> ---
>
> Key: BEAM-10411
> URL: https://issues.apache.org/jira/browse/BEAM-10411
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-kafka
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P2
>  Labels: stale-assigned
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10507) support xlang custom window fn

2020-08-25 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184844#comment-17184844
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10507:
--

We should also add tests that use external standard and custom window fns to 
x-lang test suites. Seems like we don't have such tests currently: 
[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/validate_runner_xlang_test.py]

> support xlang custom window fn
> --
>
> Key: BEAM-10507
> URL: https://issues.apache.org/jira/browse/BEAM-10507
> Project: Beam
>  Issue Type: Bug
>  Components: cross-language
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: P2
>
> support xlang custom window fn



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10663) Python postcommit fails after BEAM-9977 #11749 merge

2020-08-27 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186119#comment-17186119
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10663:
--

I think new SDF Kafka implementation introduced multiple regression for 
Dataflow. One was fixed by [https://github.com/apache/beam/pull/12585] but 
there's another dataloss error that has not been fixed yet. 

Given that 2.23.0 used SDF BoundedSource-wrapper I don't think using the same 
for 2.24.0 can not be considered a regression. IMO we should try to fix issues 
with new SDF Kafka implementation to enable it for 2.25.0.

 

> Python postcommit fails after BEAM-9977 #11749 merge
> 
>
> Key: BEAM-10663
> URL: https://issues.apache.org/jira/browse/BEAM-10663
> Project: Beam
>  Issue Type: Bug
>  Components: cross-language, io-py-kafka, test-failures
>Affects Versions: 2.24.0
>Reporter: Piotr Szuberski
>Assignee: Boyuan Zhang
>Priority: P0
> Fix For: 2.24.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Python postcommits fail on CrossLanguageKafkaIO python tests after #11749 
> (BEAM-9977) merge.
>  
> Fragment of stackstrace:
> ```
> {{Caused by: java.util.concurrent.ExecutionException: 
> java.lang.RuntimeException: Error received from SDK harness for instruction 
> 2: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could 
> not find a way to create AutoValue class class 
> org.apache.beam.sdk.io.kafka.KafkaSourceDescriptor
>   at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>   at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>   at 
> org.apache.beam.sdk.fn.data.CompletableFutureInboundDataClient.awaitCompletion(CompletableFutureInboundDataClient.java:48)
>   at 
> org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.awaitCompletion(BeamFnDataInboundObserver.java:91)
>   at 
> org.apache.beam.fn.harness.BeamFnDataReadRunner.blockTillReadFinishes(BeamFnDataReadRunner.java:342)
>   at 
> org.apache.beam.fn.harness.data.PTransformFunctionRegistry.lambda$register$0(PTransformFunctionRegistry.java:108)
>   at 
> org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:302)
>   at 
> org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:173)
>   at 
> org.apache.beam.fn.harness.control.BeamFnControlClient.lambda$processInstructionRequests$0(BeamFnControlClient.java:157)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)}}
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (BEAM-10663) Python postcommit fails after BEAM-9977 #11749 merge

2020-08-27 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186119#comment-17186119
 ] 

Chamikara Madhusanka Jayalath edited comment on BEAM-10663 at 8/27/20, 9:37 PM:


I think new SDF Kafka implementation introduced multiple regression for 
Dataflow. One was fixed by [https://github.com/apache/beam/pull/12585] but 
there's another dataloss error that has not been fixed yet. 

Given that 2.23.0 used SDF BoundedSource-wrapper I don't think using the same 
for 2.24.0 can be considered a regression. IMO we should try to fix issues with 
new SDF Kafka implementation to enable it for 2.25.0.

 


was (Author: chamikara):
I think new SDF Kafka implementation introduced multiple regression for 
Dataflow. One was fixed by [https://github.com/apache/beam/pull/12585] but 
there's another dataloss error that has not been fixed yet. 

Given that 2.23.0 used SDF BoundedSource-wrapper I don't think using the same 
for 2.24.0 can not be considered a regression. IMO we should try to fix issues 
with new SDF Kafka implementation to enable it for 2.25.0.

 

> Python postcommit fails after BEAM-9977 #11749 merge
> 
>
> Key: BEAM-10663
> URL: https://issues.apache.org/jira/browse/BEAM-10663
> Project: Beam
>  Issue Type: Bug
>  Components: cross-language, io-py-kafka, test-failures
>Affects Versions: 2.24.0
>Reporter: Piotr Szuberski
>Assignee: Boyuan Zhang
>Priority: P0
> Fix For: 2.24.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Python postcommits fail on CrossLanguageKafkaIO python tests after #11749 
> (BEAM-9977) merge.
>  
> Fragment of stackstrace:
> ```
> {{Caused by: java.util.concurrent.ExecutionException: 
> java.lang.RuntimeException: Error received from SDK harness for instruction 
> 2: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could 
> not find a way to create AutoValue class class 
> org.apache.beam.sdk.io.kafka.KafkaSourceDescriptor
>   at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>   at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>   at 
> org.apache.beam.sdk.fn.data.CompletableFutureInboundDataClient.awaitCompletion(CompletableFutureInboundDataClient.java:48)
>   at 
> org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.awaitCompletion(BeamFnDataInboundObserver.java:91)
>   at 
> org.apache.beam.fn.harness.BeamFnDataReadRunner.blockTillReadFinishes(BeamFnDataReadRunner.java:342)
>   at 
> org.apache.beam.fn.harness.data.PTransformFunctionRegistry.lambda$register$0(PTransformFunctionRegistry.java:108)
>   at 
> org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:302)
>   at 
> org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:173)
>   at 
> org.apache.beam.fn.harness.control.BeamFnControlClient.lambda$processInstructionRequests$0(BeamFnControlClient.java:157)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)}}
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10869) Pubsub native write should be supported over fnapi

2020-11-02 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-10869:
-
Priority: P0  (was: P2)

> Pubsub native write should be supported over fnapi
> --
>
> Key: BEAM-10869
> URL: https://issues.apache.org/jira/browse/BEAM-10869
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Boyuan Zhang
>Priority: P0
>  Time Spent: 15h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10869) Pubsub native write should be supported over fnapi

2020-11-02 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-10869:
-
Fix Version/s: 2.26.0

> Pubsub native write should be supported over fnapi
> --
>
> Key: BEAM-10869
> URL: https://issues.apache.org/jira/browse/BEAM-10869
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Boyuan Zhang
>Priority: P0
> Fix For: 2.26.0
>
>  Time Spent: 15h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10869) Pubsub native write should be supported over fnapi

2020-11-02 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17225023#comment-17225023
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10869:
--

I believe the remaining work here is updating the runner API proto. This should 
be a blocker for 2.26.0 since we should try to finalize the proto before it's 
released.

I'm looking into this but currently running into an issue due to internal tests 
failing for Dataflow Runner v2.

> Pubsub native write should be supported over fnapi
> --
>
> Key: BEAM-10869
> URL: https://issues.apache.org/jira/browse/BEAM-10869
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Boyuan Zhang
>Priority: P0
> Fix For: 2.26.0
>
>  Time Spent: 15h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-10869) Pubsub native write should be supported over fnapi

2020-11-02 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath reassigned BEAM-10869:


Assignee: Chamikara Madhusanka Jayalath

> Pubsub native write should be supported over fnapi
> --
>
> Key: BEAM-10869
> URL: https://issues.apache.org/jira/browse/BEAM-10869
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Boyuan Zhang
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P0
> Fix For: 2.26.0
>
>  Time Spent: 15h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-11129) Populate display data in Beam protos.

2020-11-02 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-11129:
-
Priority: P1  (was: P0)

> Populate display data in Beam protos.
> -
>
> Key: BEAM-11129
> URL: https://issues.apache.org/jira/browse/BEAM-11129
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model, sdk-java-core, sdk-py-core
>Reporter: Robert Bradshaw
>Priority: P1
> Fix For: 2.26.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-11129) Populate display data in Beam protos.

2020-11-02 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-11129:
-
Fix Version/s: (was: 2.26.0)

> Populate display data in Beam protos.
> -
>
> Key: BEAM-11129
> URL: https://issues.apache.org/jira/browse/BEAM-11129
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model, sdk-java-core, sdk-py-core
>Reporter: Robert Bradshaw
>Priority: P1
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-11129) Populate display data in Beam protos.

2020-11-02 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-11129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17225035#comment-17225035
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-11129:
--

I don't believe this is a blocker for 2.26.0.

> Populate display data in Beam protos.
> -
>
> Key: BEAM-11129
> URL: https://issues.apache.org/jira/browse/BEAM-11129
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model, sdk-java-core, sdk-py-core
>Reporter: Robert Bradshaw
>Priority: P0
> Fix For: 2.26.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10869) Pubsub native write should be supported over fnapi

2020-11-03 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-10869:
-
Fix Version/s: (was: 2.26.0)

> Pubsub native write should be supported over fnapi
> --
>
> Key: BEAM-10869
> URL: https://issues.apache.org/jira/browse/BEAM-10869
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Boyuan Zhang
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P0
>  Time Spent: 15h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10869) Pubsub native write should be supported over fnapi

2020-11-03 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-10869:
-
Priority: P1  (was: P0)

> Pubsub native write should be supported over fnapi
> --
>
> Key: BEAM-10869
> URL: https://issues.apache.org/jira/browse/BEAM-10869
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Boyuan Zhang
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P1
>  Time Spent: 15h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10869) Pubsub native write should be supported over fnapi

2020-11-03 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17225865#comment-17225865
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10869:
--

I believe P0 work is done here. Assigning to Boyuan to close if this is done.

> Pubsub native write should be supported over fnapi
> --
>
> Key: BEAM-10869
> URL: https://issues.apache.org/jira/browse/BEAM-10869
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Boyuan Zhang
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P1
>  Time Spent: 15h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-10869) Pubsub native write should be supported over fnapi

2020-11-03 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath reassigned BEAM-10869:


Assignee: Boyuan Zhang  (was: Chamikara Madhusanka Jayalath)

> Pubsub native write should be supported over fnapi
> --
>
> Key: BEAM-10869
> URL: https://issues.apache.org/jira/browse/BEAM-10869
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Boyuan Zhang
>Assignee: Boyuan Zhang
>Priority: P1
>  Time Spent: 15h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-11033) Update Dataflow metrics processor to handle portable jobs

2020-11-10 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-11033:
-
Priority: P1  (was: P2)

> Update Dataflow metrics processor to handle portable jobs
> -
>
> Key: BEAM-11033
> URL: https://issues.apache.org/jira/browse/BEAM-11033
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P1
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Currently, Dataflow metrics processor expects Dataflow internal step names 
> generated for v1beta3 job description in metrics returned by Dataflow 
> service: 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/dataflow_metrics.py#L97]
>  
> But with portable job submission, Dataflow uses PTransform ID (in proto 
> pipeline) as the internal step name. Hence metrics processor should be 
> updated to handle this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (BEAM-11033) Update Dataflow metrics processor to handle portable jobs

2020-11-10 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath reopened BEAM-11033:
--

> Update Dataflow metrics processor to handle portable jobs
> -
>
> Key: BEAM-11033
> URL: https://issues.apache.org/jira/browse/BEAM-11033
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P2
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Currently, Dataflow metrics processor expects Dataflow internal step names 
> generated for v1beta3 job description in metrics returned by Dataflow 
> service: 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/dataflow_metrics.py#L97]
>  
> But with portable job submission, Dataflow uses PTransform ID (in proto 
> pipeline) as the internal step name. Hence metrics processor should be 
> updated to handle this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-11033) Update Dataflow metrics processor to handle portable jobs

2020-11-10 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-11033:
-
Fix Version/s: 2.26.0

> Update Dataflow metrics processor to handle portable jobs
> -
>
> Key: BEAM-11033
> URL: https://issues.apache.org/jira/browse/BEAM-11033
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P1
> Fix For: 2.26.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Currently, Dataflow metrics processor expects Dataflow internal step names 
> generated for v1beta3 job description in metrics returned by Dataflow 
> service: 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/dataflow_metrics.py#L97]
>  
> But with portable job submission, Dataflow uses PTransform ID (in proto 
> pipeline) as the internal step name. Hence metrics processor should be 
> updated to handle this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-11254) Add documentation for authoring and using multi-language pipelines

2020-11-12 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-11254:
-
Summary: Add documentation for authoring and using multi-language pipelines 
 (was: Add documentation for authoring and using cross-language pipelines)

> Add documentation for authoring and using multi-language pipelines
> --
>
> Key: BEAM-11254
> URL: https://issues.apache.org/jira/browse/BEAM-11254
> Project: Beam
>  Issue Type: New Feature
>  Components: cross-language
>Reporter: Chamikara Madhusanka Jayalath
>Priority: P2
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-11254) Add documentation for authoring and using cross-language pipelines

2020-11-12 Thread Chamikara Madhusanka Jayalath (Jira)
Chamikara Madhusanka Jayalath created BEAM-11254:


 Summary: Add documentation for authoring and using cross-language 
pipelines
 Key: BEAM-11254
 URL: https://issues.apache.org/jira/browse/BEAM-11254
 Project: Beam
  Issue Type: New Feature
  Components: cross-language
Reporter: Chamikara Madhusanka Jayalath






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-11033) Update Dataflow metrics processor to handle portable jobs

2020-11-12 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-11033:
-
Status: Resolved  (was: Open)

> Update Dataflow metrics processor to handle portable jobs
> -
>
> Key: BEAM-11033
> URL: https://issues.apache.org/jira/browse/BEAM-11033
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P1
> Fix For: 2.26.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> Currently, Dataflow metrics processor expects Dataflow internal step names 
> generated for v1beta3 job description in metrics returned by Dataflow 
> service: 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/dataflow_metrics.py#L97]
>  
> But with portable job submission, Dataflow uses PTransform ID (in proto 
> pipeline) as the internal step name. Hence metrics processor should be 
> updated to handle this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-6807) Implement an Azure blobstore filesystem for Python SDK

2020-11-13 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-6807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231821#comment-17231821
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-6807:
-

Can we resolve this since [https://github.com/apache/beam/pull/12492] was 
merged ?

> Implement an Azure blobstore filesystem for Python SDK
> --
>
> Key: BEAM-6807
> URL: https://issues.apache.org/jira/browse/BEAM-6807
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Pablo Estrada
>Assignee: Aldair Coronel
>Priority: P2
>  Labels: GSoC2019, azure, azureblob, gsoc, gsoc2019, gsoc2020, 
> mentor
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> This is similar to BEAM-2572, but for Azure's blobstore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-7925) ParquetIO supports neither column projection nor filter predicate

2020-11-13 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-7925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231858#comment-17231858
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-7925:
-

Can this Jira be resolved since [https://github.com/apache/beam/pull/12786] 
went in ?

 

cc: [~heejong] [~danielxjd]

> ParquetIO supports neither column projection nor filter predicate
> -
>
> Key: BEAM-7925
> URL: https://issues.apache.org/jira/browse/BEAM-7925
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-parquet
>Affects Versions: 2.14.0
>Reporter: Neville Li
>Priority: P3
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Current {{ParquetIO}} supports neither column projection nor filter predicate 
> which defeats the performance motivation of using Parquet in the first place. 
> That's why we have our own implementation of 
> [ParquetIO|https://github.com/spotify/scio/tree/master/scio-parquet/src] in 
> Scio.
> Reading Parquet as Avro with column projection has some complications, 
> namely, the resulting Avro records may be incomplete and will not survive 
> ser/de. A workaround maybe provide a {{TypedRead}} interface that takes a 
> {{Function}} that maps invalid Avro {{A}} into user defined type {{B}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10397) add missing environment in windowing strategy for Dataflow

2020-07-15 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-10397:
-
Fix Version/s: 2.23.0

> add missing environment in windowing strategy for Dataflow
> --
>
> Key: BEAM-10397
> URL: https://issues.apache.org/jira/browse/BEAM-10397
> Project: Beam
>  Issue Type: Bug
>  Components: cross-language, runner-dataflow
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: P2
> Fix For: 2.23.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> add missing environment in windowing strategy for Dataflow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10397) add missing environment in windowing strategy for Dataflow

2020-07-15 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-10397:
-
Priority: P1  (was: P2)

> add missing environment in windowing strategy for Dataflow
> --
>
> Key: BEAM-10397
> URL: https://issues.apache.org/jira/browse/BEAM-10397
> Project: Beam
>  Issue Type: Bug
>  Components: cross-language, runner-dataflow
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: P1
> Fix For: 2.23.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> add missing environment in windowing strategy for Dataflow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10507) support xlang custom window fn

2020-07-15 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-10507:
-
Priority: P1  (was: P2)

> support xlang custom window fn
> --
>
> Key: BEAM-10507
> URL: https://issues.apache.org/jira/browse/BEAM-10507
> Project: Beam
>  Issue Type: Bug
>  Components: cross-language
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: P1
>
> support xlang custom window fn



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10507) support xlang custom window fn

2020-07-15 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-10507:
-
Priority: P2  (was: P1)

> support xlang custom window fn
> --
>
> Key: BEAM-10507
> URL: https://issues.apache.org/jira/browse/BEAM-10507
> Project: Beam
>  Issue Type: Bug
>  Components: cross-language
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: P2
>
> support xlang custom window fn



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-10511) Allow users to specify the temporary dataset for BQ

2020-07-16 Thread Chamikara Madhusanka Jayalath (Jira)
Chamikara Madhusanka Jayalath created BEAM-10511:


 Summary: Allow users to specify the temporary dataset for BQ
 Key: BEAM-10511
 URL: https://issues.apache.org/jira/browse/BEAM-10511
 Project: Beam
  Issue Type: Improvement
  Components: io-py-gcp
Reporter: Chamikara Madhusanka Jayalath
Assignee: Pablo Estrada


We recently added this to Java: https://issues.apache.org/jira/browse/BEAM-8458

 

We should add this to Python custom source as well.

 

Pablo, assigning to you for now. I suspect this will be a relatively small 
change for the custom source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-6868) Flink runner supports Bundle Finalization

2020-07-20 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161563#comment-17161563
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-6868:
-

[~boyuanz] is this a regression ? X-lang KafkaIO used to work for portable 
Flink/Spark AFAIK.

(also seems like this is high priority than P3)

> Flink runner supports Bundle Finalization
> -
>
> Key: BEAM-6868
> URL: https://issues.apache.org/jira/browse/BEAM-6868
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink
>Reporter: Boyuan Zhang
>Priority: P3
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-6868) Flink runner supports Bundle Finalization

2020-07-20 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-6868:

Priority: P1  (was: P3)

> Flink runner supports Bundle Finalization
> -
>
> Key: BEAM-6868
> URL: https://issues.apache.org/jira/browse/BEAM-6868
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink
>Reporter: Boyuan Zhang
>Priority: P1
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (BEAM-6868) Flink runner supports Bundle Finalization

2020-07-20 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161563#comment-17161563
 ] 

Chamikara Madhusanka Jayalath edited comment on BEAM-6868 at 7/20/20, 9:46 PM:
---

[~boyuanz] and [~mxm] is this a regression ? X-lang KafkaIO used to work for 
portable Flink/Spark AFAIK.

(also seems like this is high priority than P3)


was (Author: chamikara):
[~boyuanz] is this a regression ? X-lang KafkaIO used to work for portable 
Flink/Spark AFAIK.

(also seems like this is high priority than P3)

> Flink runner supports Bundle Finalization
> -
>
> Key: BEAM-6868
> URL: https://issues.apache.org/jira/browse/BEAM-6868
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink
>Reporter: Boyuan Zhang
>Priority: P1
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-6868) Flink runner supports Bundle Finalization

2020-07-22 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-6868:

Component/s: cross-language

> Flink runner supports Bundle Finalization
> -
>
> Key: BEAM-6868
> URL: https://issues.apache.org/jira/browse/BEAM-6868
> Project: Beam
>  Issue Type: New Feature
>  Components: cross-language, runner-flink
>Reporter: Boyuan Zhang
>Priority: P1
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-6868) Flink runner supports Bundle Finalization

2020-07-23 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164050#comment-17164050
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-6868:
-

I think users are running into this. For example, see 
[https://lists.apache.org/thread.html/re53d28276def9593dc5f28cbd9710d141aa0b68e2e0486d401724398%40%3Cdev.beam.apache.org%3E]

> Flink runner supports Bundle Finalization
> -
>
> Key: BEAM-6868
> URL: https://issues.apache.org/jira/browse/BEAM-6868
> Project: Beam
>  Issue Type: New Feature
>  Components: cross-language, runner-flink
>Reporter: Boyuan Zhang
>Priority: P1
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10593) Publish SNAPSHOT SDK harness containers nightly

2020-07-28 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-10593:
-
Component/s: cross-language

> Publish SNAPSHOT SDK harness containers nightly
> ---
>
> Key: BEAM-10593
> URL: https://issues.apache.org/jira/browse/BEAM-10593
> Project: Beam
>  Issue Type: Task
>  Components: cross-language, sdk-java-harness, sdk-py-harness
>Reporter: Brian Hulette
>Priority: P2
>
> Facilitates testing distributed runners at HEAD. 
> Discussed on 
> [dev@|https://lists.apache.org/thread.html/rd26405613c9b7b53c860080a9372e54aa5f3410e5be0db8ca414ec07%40%3Cdev.beam.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10011) Test Python SqlTransform on Dataflow

2020-07-28 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-10011:
-
Component/s: cross-language

> Test Python SqlTransform on Dataflow
> 
>
> Key: BEAM-10011
> URL: https://issues.apache.org/jira/browse/BEAM-10011
> Project: Beam
>  Issue Type: Improvement
>  Components: cross-language, sdk-py-core
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: P2
>
> We should run sql_test.py on Dataflow. Also need to make sure Java VR suite 
> is run on Dataflow runner v2 (to verify the expanded SQL transforms will work 
> there).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status

2020-07-29 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath reassigned BEAM-10248:


Assignee: Chamikara Madhusanka Jayalath

> Beam does not set correct region for BigQuery when requesting load job status
> -
>
> Key: BEAM-10248
> URL: https://issues.apache.org/jira/browse/BEAM-10248
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.22.0
> Environment: -Beam 2.22.0
> -DirectRunner & DataflowRunner
> -Java 11
>Reporter: Graham Polley
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P1
> Attachments: Untitled document.pdf
>
>
> I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to 
> BigQuery. My BigQuery dataset is in region `australia-southeast1`.
> If the load job into BigQuery fails for some reason (e.g. table does not 
> exist, or schema has changed), then an error from BigQuery is returned for 
> the load.
> Beam then enters into a loop of retrying the load job. However, instead of 
> using a new job id (where the suffix is incremented by 1), it wrongly tries 
> to reinsert the job using the same job id. This is because when it tries to 
> look up if the job id already exists, it does not take into consideration the 
> region where the dataset is. Instead, it defaults to the US but the job was 
> created in `australia-southeast1`, so it returns `null`.
> Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It 
> returns `null` because it's looking in the wrong region.
> If I *test* using a dataset in BigQuery that is in the US region, then it 
> correctly finds the job id and begins to retry the job with a new job id 
> (suffixed with the retry count). We have seen this bug in other areas of Beam 
> before and in other tools/services on GCP e.g. Cloud Composer.
> However, that leads me to my next problem/bug. Even if that is fixed, the 
> number of retries is set to `Integer.MX_VALUE`, so Beam will keep retrying 
> the job 2,147,483,647 times. This is not good.
> The exception is swallowed up by the Beam SDK and never propagated back up 
> the stack for users to catch and handle. So, if a load job fails there is no 
> way to handle it and react for users.
> I attempted to set `withFailedInsertRetryPolicy()` to transient errors only, 
> but it is not supported with `FILE_LOADS`. I also tried, using the 
> `WriteResult` object returned from the bigQuery sink/write, to get a handle 
> on the error but it does not work. Users need a way to catch and catch failed 
> load jobs when using `FILE_LOADS`.
>  
> {code:java}
> public class TemplatePipeline {
>  private static final String TOPIC = 
> "projects/etl-demo-269105/topics/test-micro-batching";
>  private static final String BIGQUERY_DESTINATION_FILTERED = 
> "etl-demo-269105:etl_demo_fun.micro_batch_test_xxx";
>  public static void main(String[] args) throws Exception {
>  try {
>  PipelineOptionsFactory.register(DataflowPipelineOptions.class);
>  DataflowPipelineOptions options = PipelineOptionsFactory
>  .fromArgs(args)
>  .withoutStrictParsing()
>  .as(DataflowPipelineOptions.class);
>  Pipeline pipeline = Pipeline.create(options);
>  PCollection messages = pipeline
>  .apply(PubsubIO.readMessagesWithAttributes().fromTopic(String.format(TOPIC, 
> options.getProject(
>  .apply(Window.into(FixedWindows.of(Duration.standardSeconds(1;
>  WriteResult result = messages.apply(ParDo.of(new RowToBQRow()))
>  .apply(BigQueryIO.writeTableRows()
>  .to(String.format(BIGQUERY_DESTINATION_FILTERED, options.getProject()))
>  .withCreateDisposition(CREATE_NEVER)
>  .withWriteDisposition(WRITE_APPEND)
>  .withMethod(BigQueryIO.Write.Method.FILE_LOADS)
>  .withTriggeringFrequency(Duration.standardSeconds(5))
>  .withNumFileShards(1)
>  .withExtendedErrorInfo()
>  .withSchema(getTableSchema()));
> // result.getFailedInsertsWithErr().apply(ParDo.of(new 
> DoFn() {
> // @ProcessElement
> // public void processElement(ProcessContext c) {
> // for(ErrorProto err : c.element().getError().getErrors()){
> // throw new RuntimeException(err.getMessage());
> // }
> //
> // }
> // }));
>  result.getFailedInserts().apply(ParDo.of(new DoFn() {
>  @ProcessElement
>  public void processElement(ProcessContext c) {
>  System.out.println(c.element());
>  c.output("foo");
>  throw new RuntimeException("Failed to load");
>  }
>  @FinishBundle
>  public void finishUp(FinishBundleContext finishBundleContextc){
>  System.out.println("Got here");
>  }
>  }));
>  pipeline.run();
>  } catch (Exception e) {
>  e.printStackTrace();
>  throw new Exception(e);
>  }
>  }
>  private static TableSchema getTableSchema() {
>  List fields = new ArrayList<>();
>  fields.add(new TableField

[jira] [Comment Edited] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status

2020-07-29 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167389#comment-17167389
 ] 

Chamikara Madhusanka Jayalath edited comment on BEAM-10248 at 7/29/20, 5:39 PM:


There were three separate concerns here.

(1) Not setting the correct region when checking the status of the job does 
seem like a bug. 

 

(2) "Even if that is fixed, the number of retries is set to `Integer.MX_VALUE`"

I'm not sure if this analysis is correct. Number of status poll attempts is set 
to Integer.MX_VALUE but the number of retries of the job seems to be coming 
from below which is set to 3.

[https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L449]

[https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L116]

 

Note though that for streaming jobs, Dataflow will keep trying failed workitems 
forever. So essentially this BQ job will be attempted forever without the job 
failing. This is just the way Dataflow streaming runner operates.

 

(3)

"So, if a load job fails there is no way to handle it and react for users."

This is a known limitation and seems to be separate from the intermediate issue 
here. Currently withFailedInsertRetryPolicy() and WriteResult is only supported 
for streaming inserts. But we should be logging job failures in worker logs.

 

 

 


was (Author: chamikara):
Not setting the correct region when checking the status of the job does seem 
like a bug. 

 

However, that leads me to my next problem/bug. Even if that is fixed, the 
number of retries is set to `Integer.MX_VALUE`: I'm not sure if this is 
correct. Number of status poll attempts is set to Integer.MX_VALUE but the 
number of retries of the job seems to be coming from below which is set to 3.

[https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L449]

[https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L116]

 

Note though that for streaming jobs, Dataflow will keep trying failed workitems 
forever. So essentially this BQ job will be attempted forever without the job 
failing. This is just the way Dataflow streaming runner operates.

 

"So, if a load job fails there is no way to handle it and react for users.": 
This is a known limitation and seems to be separate from the intermediate issue 
here. Currently withFailedInsertRetryPolicy() and WriteResult is only supported 
for streaming inserts. But we should be logging job failures in worker logs.

 

 

 

> Beam does not set correct region for BigQuery when requesting load job status
> -
>
> Key: BEAM-10248
> URL: https://issues.apache.org/jira/browse/BEAM-10248
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.22.0
> Environment: -Beam 2.22.0
> -DirectRunner & DataflowRunner
> -Java 11
>Reporter: Graham Polley
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P1
> Attachments: Untitled document.pdf
>
>
> I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to 
> BigQuery. My BigQuery dataset is in region `australia-southeast1`.
> If the load job into BigQuery fails for some reason (e.g. table does not 
> exist, or schema has changed), then an error from BigQuery is returned for 
> the load.
> Beam then enters into a loop of retrying the load job. However, instead of 
> using a new job id (where the suffix is incremented by 1), it wrongly tries 
> to reinsert the job using the same job id. This is because when it tries to 
> look up if the job id already exists, it does not take into consideration the 
> region where the dataset is. Instead, it defaults to the US but the job was 
> created in `australia-southeast1`, so it returns `null`.
> Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It 
> returns `null` because it's looking in the wrong region.
> If I *test* using a dataset in BigQuery that is in the US region, then it 
> correctly finds the job id and begins to retry the job with a new job id 
> (suffixed with the retry count). We have seen this bug in other areas of Beam 
> before and in other tools/services on GCP e.g. Cloud Composer.
> However, that leads me to my next problem/bug. Even if that is fixed, the 
> number of retries is set to `Integer.MX_VALUE`

[jira] [Commented] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status

2020-07-29 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167389#comment-17167389
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10248:
--

Not setting the correct region when checking the status of the job does seem 
like a bug. 

 

However, that leads me to my next problem/bug. Even if that is fixed, the 
number of retries is set to `Integer.MX_VALUE`: I'm not sure if this is 
correct. Number of status poll attempts is set to Integer.MX_VALUE but the 
number of retries of the job seems to be coming from below which is set to 3.

[https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L449]

[https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L116]

 

Note though that for streaming jobs, Dataflow will keep trying failed workitems 
forever. So essentially this BQ job will be attempted forever without the job 
failing. This is just the way Dataflow streaming runner operates.

 

"So, if a load job fails there is no way to handle it and react for users.": 
This is a known limitation and seems to be separate from the intermediate issue 
here. Currently withFailedInsertRetryPolicy() and WriteResult is only supported 
for streaming inserts. But we should be logging job failures in worker logs.

 

 

 

> Beam does not set correct region for BigQuery when requesting load job status
> -
>
> Key: BEAM-10248
> URL: https://issues.apache.org/jira/browse/BEAM-10248
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.22.0
> Environment: -Beam 2.22.0
> -DirectRunner & DataflowRunner
> -Java 11
>Reporter: Graham Polley
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P1
> Attachments: Untitled document.pdf
>
>
> I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to 
> BigQuery. My BigQuery dataset is in region `australia-southeast1`.
> If the load job into BigQuery fails for some reason (e.g. table does not 
> exist, or schema has changed), then an error from BigQuery is returned for 
> the load.
> Beam then enters into a loop of retrying the load job. However, instead of 
> using a new job id (where the suffix is incremented by 1), it wrongly tries 
> to reinsert the job using the same job id. This is because when it tries to 
> look up if the job id already exists, it does not take into consideration the 
> region where the dataset is. Instead, it defaults to the US but the job was 
> created in `australia-southeast1`, so it returns `null`.
> Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It 
> returns `null` because it's looking in the wrong region.
> If I *test* using a dataset in BigQuery that is in the US region, then it 
> correctly finds the job id and begins to retry the job with a new job id 
> (suffixed with the retry count). We have seen this bug in other areas of Beam 
> before and in other tools/services on GCP e.g. Cloud Composer.
> However, that leads me to my next problem/bug. Even if that is fixed, the 
> number of retries is set to `Integer.MX_VALUE`, so Beam will keep retrying 
> the job 2,147,483,647 times. This is not good.
> The exception is swallowed up by the Beam SDK and never propagated back up 
> the stack for users to catch and handle. So, if a load job fails there is no 
> way to handle it and react for users.
> I attempted to set `withFailedInsertRetryPolicy()` to transient errors only, 
> but it is not supported with `FILE_LOADS`. I also tried, using the 
> `WriteResult` object returned from the bigQuery sink/write, to get a handle 
> on the error but it does not work. Users need a way to catch and catch failed 
> load jobs when using `FILE_LOADS`.
>  
> {code:java}
> public class TemplatePipeline {
>  private static final String TOPIC = 
> "projects/etl-demo-269105/topics/test-micro-batching";
>  private static final String BIGQUERY_DESTINATION_FILTERED = 
> "etl-demo-269105:etl_demo_fun.micro_batch_test_xxx";
>  public static void main(String[] args) throws Exception {
>  try {
>  PipelineOptionsFactory.register(DataflowPipelineOptions.class);
>  DataflowPipelineOptions options = PipelineOptionsFactory
>  .fromArgs(args)
>  .withoutStrictParsing()
>  .as(DataflowPipelineOptions.class);
>  Pipeline pipeline = Pipeline.create(options);
>  PCollection messages = pipeline
>  .apply(PubsubIO.readMessagesWithAttributes().fromTopic(String.format(TOPIC, 
> options.getProject(
>  .apply(Window.into(FixedWindows.of(Duration.standardSecond

[jira] [Comment Edited] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status

2020-07-29 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167389#comment-17167389
 ] 

Chamikara Madhusanka Jayalath edited comment on BEAM-10248 at 7/29/20, 5:39 PM:


There were three separate concerns here.

 

(1) Not setting the correct region when checking the status of the job does 
seem like a bug. 

 

(2) "Even if that is fixed, the number of retries is set to `Integer.MX_VALUE`"

I'm not sure if this analysis is correct. Number of status poll attempts is set 
to Integer.MX_VALUE but the number of retries of the job seems to be coming 
from below which is set to 3.

[https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L449]

[https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L116]

 

Note though that for streaming jobs, Dataflow will keep trying failed workitems 
forever. So essentially this BQ job will be attempted forever without the job 
failing. This is just the way Dataflow streaming runner operates.

 

(3)

"So, if a load job fails there is no way to handle it and react for users."

This is a known limitation and seems to be separate from the intermediate issue 
here. Currently withFailedInsertRetryPolicy() and WriteResult is only supported 
for streaming inserts. But we should be logging job failures in worker logs.

 

 

 


was (Author: chamikara):
There were three separate concerns here.

(1) Not setting the correct region when checking the status of the job does 
seem like a bug. 

 

(2) "Even if that is fixed, the number of retries is set to `Integer.MX_VALUE`"

I'm not sure if this analysis is correct. Number of status poll attempts is set 
to Integer.MX_VALUE but the number of retries of the job seems to be coming 
from below which is set to 3.

[https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L449]

[https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L116]

 

Note though that for streaming jobs, Dataflow will keep trying failed workitems 
forever. So essentially this BQ job will be attempted forever without the job 
failing. This is just the way Dataflow streaming runner operates.

 

(3)

"So, if a load job fails there is no way to handle it and react for users."

This is a known limitation and seems to be separate from the intermediate issue 
here. Currently withFailedInsertRetryPolicy() and WriteResult is only supported 
for streaming inserts. But we should be logging job failures in worker logs.

 

 

 

> Beam does not set correct region for BigQuery when requesting load job status
> -
>
> Key: BEAM-10248
> URL: https://issues.apache.org/jira/browse/BEAM-10248
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.22.0
> Environment: -Beam 2.22.0
> -DirectRunner & DataflowRunner
> -Java 11
>Reporter: Graham Polley
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P1
> Attachments: Untitled document.pdf
>
>
> I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to 
> BigQuery. My BigQuery dataset is in region `australia-southeast1`.
> If the load job into BigQuery fails for some reason (e.g. table does not 
> exist, or schema has changed), then an error from BigQuery is returned for 
> the load.
> Beam then enters into a loop of retrying the load job. However, instead of 
> using a new job id (where the suffix is incremented by 1), it wrongly tries 
> to reinsert the job using the same job id. This is because when it tries to 
> look up if the job id already exists, it does not take into consideration the 
> region where the dataset is. Instead, it defaults to the US but the job was 
> created in `australia-southeast1`, so it returns `null`.
> Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It 
> returns `null` because it's looking in the wrong region.
> If I *test* using a dataset in BigQuery that is in the US region, then it 
> correctly finds the job id and begins to retry the job with a new job id 
> (suffixed with the retry count). We have seen this bug in other areas of Beam 
> before and in other tools/services on GCP e.g. Cloud Composer.
> However, that leads me to my next problem/bug. Even if that is fixed, the 
> number of retries is set t

[jira] [Comment Edited] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status

2020-07-29 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167389#comment-17167389
 ] 

Chamikara Madhusanka Jayalath edited comment on BEAM-10248 at 7/29/20, 6:06 PM:


There were three separate concerns here.

 

(1) Not setting the correct region when checking the status of the job does 
seem like a bug. 

 

(2) "Even if that is fixed, the number of retries is set to `Integer.MX_VALUE`"

I'm not sure if this analysis is correct. Number of status poll attempts is set 
to Integer.MX_VALUE but the number of retries of the job seems to be coming 
from below which is set to 3.

[https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L449]

[https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L116]

 

Note though that for streaming jobs, Dataflow will keep trying failed workitems 
forever. So essentially this BQ job will be attempted forever without the 
Dataflow job failing. This is just the way Dataflow streaming runner operates.

 

(3)

"So, if a load job fails there is no way to handle it and react for users."

This is a known limitation and seems to be separate from the intermediate issue 
here. Currently withFailedInsertRetryPolicy() and WriteResult is only supported 
for streaming inserts. But we should be logging job failures in worker logs.

 

 

 


was (Author: chamikara):
There were three separate concerns here.

 

(1) Not setting the correct region when checking the status of the job does 
seem like a bug. 

 

(2) "Even if that is fixed, the number of retries is set to `Integer.MX_VALUE`"

I'm not sure if this analysis is correct. Number of status poll attempts is set 
to Integer.MX_VALUE but the number of retries of the job seems to be coming 
from below which is set to 3.

[https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L449]

[https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L116]

 

Note though that for streaming jobs, Dataflow will keep trying failed workitems 
forever. So essentially this BQ job will be attempted forever without the job 
failing. This is just the way Dataflow streaming runner operates.

 

(3)

"So, if a load job fails there is no way to handle it and react for users."

This is a known limitation and seems to be separate from the intermediate issue 
here. Currently withFailedInsertRetryPolicy() and WriteResult is only supported 
for streaming inserts. But we should be logging job failures in worker logs.

 

 

 

> Beam does not set correct region for BigQuery when requesting load job status
> -
>
> Key: BEAM-10248
> URL: https://issues.apache.org/jira/browse/BEAM-10248
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.22.0
> Environment: -Beam 2.22.0
> -DirectRunner & DataflowRunner
> -Java 11
>Reporter: Graham Polley
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P1
> Attachments: Untitled document.pdf
>
>
> I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to 
> BigQuery. My BigQuery dataset is in region `australia-southeast1`.
> If the load job into BigQuery fails for some reason (e.g. table does not 
> exist, or schema has changed), then an error from BigQuery is returned for 
> the load.
> Beam then enters into a loop of retrying the load job. However, instead of 
> using a new job id (where the suffix is incremented by 1), it wrongly tries 
> to reinsert the job using the same job id. This is because when it tries to 
> look up if the job id already exists, it does not take into consideration the 
> region where the dataset is. Instead, it defaults to the US but the job was 
> created in `australia-southeast1`, so it returns `null`.
> Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It 
> returns `null` because it's looking in the wrong region.
> If I *test* using a dataset in BigQuery that is in the US region, then it 
> correctly finds the job id and begins to retry the job with a new job id 
> (suffixed with the retry count). We have seen this bug in other areas of Beam 
> before and in other tools/services on GCP e.g. Cloud Composer.
> However, that leads me to my next problem/bug. Even if that is fixed, the 
> number of retr

[jira] [Commented] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status

2020-07-30 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168164#comment-17168164
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10248:
--

Hmm, looking a bit further into code seems like we do set location in the 
lookup job function.

 

[https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryHelpers.java#L182]

[https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L444]

 

> Beam does not set correct region for BigQuery when requesting load job status
> -
>
> Key: BEAM-10248
> URL: https://issues.apache.org/jira/browse/BEAM-10248
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.22.0
> Environment: -Beam 2.22.0
> -DirectRunner & DataflowRunner
> -Java 11
>Reporter: Graham Polley
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P1
> Attachments: Untitled document.pdf
>
>
> I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to 
> BigQuery. My BigQuery dataset is in region `australia-southeast1`.
> If the load job into BigQuery fails for some reason (e.g. table does not 
> exist, or schema has changed), then an error from BigQuery is returned for 
> the load.
> Beam then enters into a loop of retrying the load job. However, instead of 
> using a new job id (where the suffix is incremented by 1), it wrongly tries 
> to reinsert the job using the same job id. This is because when it tries to 
> look up if the job id already exists, it does not take into consideration the 
> region where the dataset is. Instead, it defaults to the US but the job was 
> created in `australia-southeast1`, so it returns `null`.
> Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It 
> returns `null` because it's looking in the wrong region.
> If I *test* using a dataset in BigQuery that is in the US region, then it 
> correctly finds the job id and begins to retry the job with a new job id 
> (suffixed with the retry count). We have seen this bug in other areas of Beam 
> before and in other tools/services on GCP e.g. Cloud Composer.
> However, that leads me to my next problem/bug. Even if that is fixed, the 
> number of retries is set to `Integer.MX_VALUE`, so Beam will keep retrying 
> the job 2,147,483,647 times. This is not good.
> The exception is swallowed up by the Beam SDK and never propagated back up 
> the stack for users to catch and handle. So, if a load job fails there is no 
> way to handle it and react for users.
> I attempted to set `withFailedInsertRetryPolicy()` to transient errors only, 
> but it is not supported with `FILE_LOADS`. I also tried, using the 
> `WriteResult` object returned from the bigQuery sink/write, to get a handle 
> on the error but it does not work. Users need a way to catch and catch failed 
> load jobs when using `FILE_LOADS`.
>  
> {code:java}
> public class TemplatePipeline {
>  private static final String TOPIC = 
> "projects/etl-demo-269105/topics/test-micro-batching";
>  private static final String BIGQUERY_DESTINATION_FILTERED = 
> "etl-demo-269105:etl_demo_fun.micro_batch_test_xxx";
>  public static void main(String[] args) throws Exception {
>  try {
>  PipelineOptionsFactory.register(DataflowPipelineOptions.class);
>  DataflowPipelineOptions options = PipelineOptionsFactory
>  .fromArgs(args)
>  .withoutStrictParsing()
>  .as(DataflowPipelineOptions.class);
>  Pipeline pipeline = Pipeline.create(options);
>  PCollection messages = pipeline
>  .apply(PubsubIO.readMessagesWithAttributes().fromTopic(String.format(TOPIC, 
> options.getProject(
>  .apply(Window.into(FixedWindows.of(Duration.standardSeconds(1;
>  WriteResult result = messages.apply(ParDo.of(new RowToBQRow()))
>  .apply(BigQueryIO.writeTableRows()
>  .to(String.format(BIGQUERY_DESTINATION_FILTERED, options.getProject()))
>  .withCreateDisposition(CREATE_NEVER)
>  .withWriteDisposition(WRITE_APPEND)
>  .withMethod(BigQueryIO.Write.Method.FILE_LOADS)
>  .withTriggeringFrequency(Duration.standardSeconds(5))
>  .withNumFileShards(1)
>  .withExtendedErrorInfo()
>  .withSchema(getTableSchema()));
> // result.getFailedInsertsWithErr().apply(ParDo.of(new 
> DoFn() {
> // @ProcessElement
> // public void processElement(ProcessContext c) {
> // for(ErrorProto err : c.element().getError().getErrors()){
> // throw new RuntimeException(err.getMessage());
> // }
> //
> // }
> // }));
>  result.getFailedInserts().apply(ParDo.of(new DoFn() {
>  @ProcessElement
>  public void processElement(ProcessContext c) {
>  System.out.println(c.element());
>  c.ou

[jira] [Commented] (BEAM-10529) Kafka XLang fails for ?empty? key/values

2020-07-30 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168303#comment-17168303
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10529:
--

I think the fix here will be a bit more complex than just changing the coder to 
NullableCoder.of(ByteArrayCoder.of()).

NullableCoder is not a standard coder, so some runners may break when it's set 
at the cross-language boundary. Should we be adding something like 
NullableCoder to standard coders ?

 

cc: [~robertwb]

> Kafka XLang fails for ?empty? key/values
> 
>
> Key: BEAM-10529
> URL: https://issues.apache.org/jira/browse/BEAM-10529
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kafka, sdk-py-core
>Reporter: Luke Cwik
>Priority: P2
>
> It looks like the Javadoc for ByteArrayDeserializer and StringDeserializer 
> can return null[1, 2] and we aren't using 
> NullableCoder.of(ByteArrayCoder.of()) in the expansion[3]. Note that KafkaIO 
> does this correctly in its regular coder inference logic[4].
> 1: 
> https://kafka.apache.org/21/javadoc/org/apache/kafka/common/serialization/ByteArrayDeserializer.html#deserialize-java.lang.String-byte:A-2:
>  
> https://kafka.apache.org/21/javadoc/org/apache/kafka/common/serialization/StringDeserializer.html#deserialize-java.lang.String-byte:A-
> 3: 
> https://github.com/apache/beam/blob/af2d6b0379d64b522ecb769d88e9e7e7b8900208/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L478
> 4: 
> https://github.com/apache/beam/blob/af2d6b0379d64b522ecb769d88e9e7e7b8900208/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/LocalDeserializerProvider.java#L85



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status

2020-07-31 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168495#comment-17168495
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10248:
--

Actually, we are ignoring the location when looking up a job. Sent out 
[https://github.com/apache/beam/pull/12431].

> Beam does not set correct region for BigQuery when requesting load job status
> -
>
> Key: BEAM-10248
> URL: https://issues.apache.org/jira/browse/BEAM-10248
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.22.0
> Environment: -Beam 2.22.0
> -DirectRunner & DataflowRunner
> -Java 11
>Reporter: Graham Polley
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P1
> Attachments: Untitled document.pdf
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to 
> BigQuery. My BigQuery dataset is in region `australia-southeast1`.
> If the load job into BigQuery fails for some reason (e.g. table does not 
> exist, or schema has changed), then an error from BigQuery is returned for 
> the load.
> Beam then enters into a loop of retrying the load job. However, instead of 
> using a new job id (where the suffix is incremented by 1), it wrongly tries 
> to reinsert the job using the same job id. This is because when it tries to 
> look up if the job id already exists, it does not take into consideration the 
> region where the dataset is. Instead, it defaults to the US but the job was 
> created in `australia-southeast1`, so it returns `null`.
> Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It 
> returns `null` because it's looking in the wrong region.
> If I *test* using a dataset in BigQuery that is in the US region, then it 
> correctly finds the job id and begins to retry the job with a new job id 
> (suffixed with the retry count). We have seen this bug in other areas of Beam 
> before and in other tools/services on GCP e.g. Cloud Composer.
> However, that leads me to my next problem/bug. Even if that is fixed, the 
> number of retries is set to `Integer.MX_VALUE`, so Beam will keep retrying 
> the job 2,147,483,647 times. This is not good.
> The exception is swallowed up by the Beam SDK and never propagated back up 
> the stack for users to catch and handle. So, if a load job fails there is no 
> way to handle it and react for users.
> I attempted to set `withFailedInsertRetryPolicy()` to transient errors only, 
> but it is not supported with `FILE_LOADS`. I also tried, using the 
> `WriteResult` object returned from the bigQuery sink/write, to get a handle 
> on the error but it does not work. Users need a way to catch and catch failed 
> load jobs when using `FILE_LOADS`.
>  
> {code:java}
> public class TemplatePipeline {
>  private static final String TOPIC = 
> "projects/etl-demo-269105/topics/test-micro-batching";
>  private static final String BIGQUERY_DESTINATION_FILTERED = 
> "etl-demo-269105:etl_demo_fun.micro_batch_test_xxx";
>  public static void main(String[] args) throws Exception {
>  try {
>  PipelineOptionsFactory.register(DataflowPipelineOptions.class);
>  DataflowPipelineOptions options = PipelineOptionsFactory
>  .fromArgs(args)
>  .withoutStrictParsing()
>  .as(DataflowPipelineOptions.class);
>  Pipeline pipeline = Pipeline.create(options);
>  PCollection messages = pipeline
>  .apply(PubsubIO.readMessagesWithAttributes().fromTopic(String.format(TOPIC, 
> options.getProject(
>  .apply(Window.into(FixedWindows.of(Duration.standardSeconds(1;
>  WriteResult result = messages.apply(ParDo.of(new RowToBQRow()))
>  .apply(BigQueryIO.writeTableRows()
>  .to(String.format(BIGQUERY_DESTINATION_FILTERED, options.getProject()))
>  .withCreateDisposition(CREATE_NEVER)
>  .withWriteDisposition(WRITE_APPEND)
>  .withMethod(BigQueryIO.Write.Method.FILE_LOADS)
>  .withTriggeringFrequency(Duration.standardSeconds(5))
>  .withNumFileShards(1)
>  .withExtendedErrorInfo()
>  .withSchema(getTableSchema()));
> // result.getFailedInsertsWithErr().apply(ParDo.of(new 
> DoFn() {
> // @ProcessElement
> // public void processElement(ProcessContext c) {
> // for(ErrorProto err : c.element().getError().getErrors()){
> // throw new RuntimeException(err.getMessage());
> // }
> //
> // }
> // }));
>  result.getFailedInserts().apply(ParDo.of(new DoFn() {
>  @ProcessElement
>  public void processElement(ProcessContext c) {
>  System.out.println(c.element());
>  c.output("foo");
>  throw new RuntimeException("Failed to load");
>  }
>  @FinishBundle
>  public void finishUp(FinishBundleContext finishBundleContextc){
>  System.out.println("Got here");
>  }
>  }));
>  pipeline.run();
>  } catch (Exception e)

[jira] [Commented] (BEAM-10507) support xlang custom window fn

2020-07-31 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168986#comment-17168986
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10507:
--

Do you have an error ?

I think probably the issue is that currently x-lang WindowFns are not supported 
by Dataflow.

> support xlang custom window fn
> --
>
> Key: BEAM-10507
> URL: https://issues.apache.org/jira/browse/BEAM-10507
> Project: Beam
>  Issue Type: Bug
>  Components: cross-language
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: P2
>
> support xlang custom window fn



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status

2020-07-31 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-10248:
-
Fix Version/s: 2.24.0

> Beam does not set correct region for BigQuery when requesting load job status
> -
>
> Key: BEAM-10248
> URL: https://issues.apache.org/jira/browse/BEAM-10248
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.22.0
> Environment: -Beam 2.22.0
> -DirectRunner & DataflowRunner
> -Java 11
>Reporter: Graham Polley
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P1
> Fix For: 2.24.0
>
> Attachments: Untitled document.pdf
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to 
> BigQuery. My BigQuery dataset is in region `australia-southeast1`.
> If the load job into BigQuery fails for some reason (e.g. table does not 
> exist, or schema has changed), then an error from BigQuery is returned for 
> the load.
> Beam then enters into a loop of retrying the load job. However, instead of 
> using a new job id (where the suffix is incremented by 1), it wrongly tries 
> to reinsert the job using the same job id. This is because when it tries to 
> look up if the job id already exists, it does not take into consideration the 
> region where the dataset is. Instead, it defaults to the US but the job was 
> created in `australia-southeast1`, so it returns `null`.
> Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It 
> returns `null` because it's looking in the wrong region.
> If I *test* using a dataset in BigQuery that is in the US region, then it 
> correctly finds the job id and begins to retry the job with a new job id 
> (suffixed with the retry count). We have seen this bug in other areas of Beam 
> before and in other tools/services on GCP e.g. Cloud Composer.
> However, that leads me to my next problem/bug. Even if that is fixed, the 
> number of retries is set to `Integer.MX_VALUE`, so Beam will keep retrying 
> the job 2,147,483,647 times. This is not good.
> The exception is swallowed up by the Beam SDK and never propagated back up 
> the stack for users to catch and handle. So, if a load job fails there is no 
> way to handle it and react for users.
> I attempted to set `withFailedInsertRetryPolicy()` to transient errors only, 
> but it is not supported with `FILE_LOADS`. I also tried, using the 
> `WriteResult` object returned from the bigQuery sink/write, to get a handle 
> on the error but it does not work. Users need a way to catch and catch failed 
> load jobs when using `FILE_LOADS`.
>  
> {code:java}
> public class TemplatePipeline {
>  private static final String TOPIC = 
> "projects/etl-demo-269105/topics/test-micro-batching";
>  private static final String BIGQUERY_DESTINATION_FILTERED = 
> "etl-demo-269105:etl_demo_fun.micro_batch_test_xxx";
>  public static void main(String[] args) throws Exception {
>  try {
>  PipelineOptionsFactory.register(DataflowPipelineOptions.class);
>  DataflowPipelineOptions options = PipelineOptionsFactory
>  .fromArgs(args)
>  .withoutStrictParsing()
>  .as(DataflowPipelineOptions.class);
>  Pipeline pipeline = Pipeline.create(options);
>  PCollection messages = pipeline
>  .apply(PubsubIO.readMessagesWithAttributes().fromTopic(String.format(TOPIC, 
> options.getProject(
>  .apply(Window.into(FixedWindows.of(Duration.standardSeconds(1;
>  WriteResult result = messages.apply(ParDo.of(new RowToBQRow()))
>  .apply(BigQueryIO.writeTableRows()
>  .to(String.format(BIGQUERY_DESTINATION_FILTERED, options.getProject()))
>  .withCreateDisposition(CREATE_NEVER)
>  .withWriteDisposition(WRITE_APPEND)
>  .withMethod(BigQueryIO.Write.Method.FILE_LOADS)
>  .withTriggeringFrequency(Duration.standardSeconds(5))
>  .withNumFileShards(1)
>  .withExtendedErrorInfo()
>  .withSchema(getTableSchema()));
> // result.getFailedInsertsWithErr().apply(ParDo.of(new 
> DoFn() {
> // @ProcessElement
> // public void processElement(ProcessContext c) {
> // for(ErrorProto err : c.element().getError().getErrors()){
> // throw new RuntimeException(err.getMessage());
> // }
> //
> // }
> // }));
>  result.getFailedInserts().apply(ParDo.of(new DoFn() {
>  @ProcessElement
>  public void processElement(ProcessContext c) {
>  System.out.println(c.element());
>  c.output("foo");
>  throw new RuntimeException("Failed to load");
>  }
>  @FinishBundle
>  public void finishUp(FinishBundleContext finishBundleContextc){
>  System.out.println("Got here");
>  }
>  }));
>  pipeline.run();
>  } catch (Exception e) {
>  e.printStackTrace();
>  throw new Exception(e);
>  }
>  }
>  private static TableSchema getTableSchema() {
>  L

[jira] [Updated] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status

2020-07-31 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-10248:
-
Status: Resolved  (was: Triage Needed)

> Beam does not set correct region for BigQuery when requesting load job status
> -
>
> Key: BEAM-10248
> URL: https://issues.apache.org/jira/browse/BEAM-10248
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.22.0
> Environment: -Beam 2.22.0
> -DirectRunner & DataflowRunner
> -Java 11
>Reporter: Graham Polley
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P1
> Fix For: 2.24.0
>
> Attachments: Untitled document.pdf
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to 
> BigQuery. My BigQuery dataset is in region `australia-southeast1`.
> If the load job into BigQuery fails for some reason (e.g. table does not 
> exist, or schema has changed), then an error from BigQuery is returned for 
> the load.
> Beam then enters into a loop of retrying the load job. However, instead of 
> using a new job id (where the suffix is incremented by 1), it wrongly tries 
> to reinsert the job using the same job id. This is because when it tries to 
> look up if the job id already exists, it does not take into consideration the 
> region where the dataset is. Instead, it defaults to the US but the job was 
> created in `australia-southeast1`, so it returns `null`.
> Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It 
> returns `null` because it's looking in the wrong region.
> If I *test* using a dataset in BigQuery that is in the US region, then it 
> correctly finds the job id and begins to retry the job with a new job id 
> (suffixed with the retry count). We have seen this bug in other areas of Beam 
> before and in other tools/services on GCP e.g. Cloud Composer.
> However, that leads me to my next problem/bug. Even if that is fixed, the 
> number of retries is set to `Integer.MX_VALUE`, so Beam will keep retrying 
> the job 2,147,483,647 times. This is not good.
> The exception is swallowed up by the Beam SDK and never propagated back up 
> the stack for users to catch and handle. So, if a load job fails there is no 
> way to handle it and react for users.
> I attempted to set `withFailedInsertRetryPolicy()` to transient errors only, 
> but it is not supported with `FILE_LOADS`. I also tried, using the 
> `WriteResult` object returned from the bigQuery sink/write, to get a handle 
> on the error but it does not work. Users need a way to catch and catch failed 
> load jobs when using `FILE_LOADS`.
>  
> {code:java}
> public class TemplatePipeline {
>  private static final String TOPIC = 
> "projects/etl-demo-269105/topics/test-micro-batching";
>  private static final String BIGQUERY_DESTINATION_FILTERED = 
> "etl-demo-269105:etl_demo_fun.micro_batch_test_xxx";
>  public static void main(String[] args) throws Exception {
>  try {
>  PipelineOptionsFactory.register(DataflowPipelineOptions.class);
>  DataflowPipelineOptions options = PipelineOptionsFactory
>  .fromArgs(args)
>  .withoutStrictParsing()
>  .as(DataflowPipelineOptions.class);
>  Pipeline pipeline = Pipeline.create(options);
>  PCollection messages = pipeline
>  .apply(PubsubIO.readMessagesWithAttributes().fromTopic(String.format(TOPIC, 
> options.getProject(
>  .apply(Window.into(FixedWindows.of(Duration.standardSeconds(1;
>  WriteResult result = messages.apply(ParDo.of(new RowToBQRow()))
>  .apply(BigQueryIO.writeTableRows()
>  .to(String.format(BIGQUERY_DESTINATION_FILTERED, options.getProject()))
>  .withCreateDisposition(CREATE_NEVER)
>  .withWriteDisposition(WRITE_APPEND)
>  .withMethod(BigQueryIO.Write.Method.FILE_LOADS)
>  .withTriggeringFrequency(Duration.standardSeconds(5))
>  .withNumFileShards(1)
>  .withExtendedErrorInfo()
>  .withSchema(getTableSchema()));
> // result.getFailedInsertsWithErr().apply(ParDo.of(new 
> DoFn() {
> // @ProcessElement
> // public void processElement(ProcessContext c) {
> // for(ErrorProto err : c.element().getError().getErrors()){
> // throw new RuntimeException(err.getMessage());
> // }
> //
> // }
> // }));
>  result.getFailedInserts().apply(ParDo.of(new DoFn() {
>  @ProcessElement
>  public void processElement(ProcessContext c) {
>  System.out.println(c.element());
>  c.output("foo");
>  throw new RuntimeException("Failed to load");
>  }
>  @FinishBundle
>  public void finishUp(FinishBundleContext finishBundleContextc){
>  System.out.println("Got here");
>  }
>  }));
>  pipeline.run();
>  } catch (Exception e) {
>  e.printStackTrace();
>  throw new Exception(e);
>  }
>  }
>  private static TableSchema getTab

[jira] [Commented] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status

2020-07-31 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169069#comment-17169069
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10248:
--

Fix for [1] was submitted and I confirmed that [2] is the intended behavior (we 
retry three times and fail for batch but will keep retrying the workitem for 
Dataflow streaming).  Resolving this.

[https://github.com/apache/beam/pull/12431]

 

As mentioned before [3] is a known limitation of BQ sink.

> Beam does not set correct region for BigQuery when requesting load job status
> -
>
> Key: BEAM-10248
> URL: https://issues.apache.org/jira/browse/BEAM-10248
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.22.0
> Environment: -Beam 2.22.0
> -DirectRunner & DataflowRunner
> -Java 11
>Reporter: Graham Polley
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P1
> Attachments: Untitled document.pdf
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to 
> BigQuery. My BigQuery dataset is in region `australia-southeast1`.
> If the load job into BigQuery fails for some reason (e.g. table does not 
> exist, or schema has changed), then an error from BigQuery is returned for 
> the load.
> Beam then enters into a loop of retrying the load job. However, instead of 
> using a new job id (where the suffix is incremented by 1), it wrongly tries 
> to reinsert the job using the same job id. This is because when it tries to 
> look up if the job id already exists, it does not take into consideration the 
> region where the dataset is. Instead, it defaults to the US but the job was 
> created in `australia-southeast1`, so it returns `null`.
> Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It 
> returns `null` because it's looking in the wrong region.
> If I *test* using a dataset in BigQuery that is in the US region, then it 
> correctly finds the job id and begins to retry the job with a new job id 
> (suffixed with the retry count). We have seen this bug in other areas of Beam 
> before and in other tools/services on GCP e.g. Cloud Composer.
> However, that leads me to my next problem/bug. Even if that is fixed, the 
> number of retries is set to `Integer.MX_VALUE`, so Beam will keep retrying 
> the job 2,147,483,647 times. This is not good.
> The exception is swallowed up by the Beam SDK and never propagated back up 
> the stack for users to catch and handle. So, if a load job fails there is no 
> way to handle it and react for users.
> I attempted to set `withFailedInsertRetryPolicy()` to transient errors only, 
> but it is not supported with `FILE_LOADS`. I also tried, using the 
> `WriteResult` object returned from the bigQuery sink/write, to get a handle 
> on the error but it does not work. Users need a way to catch and catch failed 
> load jobs when using `FILE_LOADS`.
>  
> {code:java}
> public class TemplatePipeline {
>  private static final String TOPIC = 
> "projects/etl-demo-269105/topics/test-micro-batching";
>  private static final String BIGQUERY_DESTINATION_FILTERED = 
> "etl-demo-269105:etl_demo_fun.micro_batch_test_xxx";
>  public static void main(String[] args) throws Exception {
>  try {
>  PipelineOptionsFactory.register(DataflowPipelineOptions.class);
>  DataflowPipelineOptions options = PipelineOptionsFactory
>  .fromArgs(args)
>  .withoutStrictParsing()
>  .as(DataflowPipelineOptions.class);
>  Pipeline pipeline = Pipeline.create(options);
>  PCollection messages = pipeline
>  .apply(PubsubIO.readMessagesWithAttributes().fromTopic(String.format(TOPIC, 
> options.getProject(
>  .apply(Window.into(FixedWindows.of(Duration.standardSeconds(1;
>  WriteResult result = messages.apply(ParDo.of(new RowToBQRow()))
>  .apply(BigQueryIO.writeTableRows()
>  .to(String.format(BIGQUERY_DESTINATION_FILTERED, options.getProject()))
>  .withCreateDisposition(CREATE_NEVER)
>  .withWriteDisposition(WRITE_APPEND)
>  .withMethod(BigQueryIO.Write.Method.FILE_LOADS)
>  .withTriggeringFrequency(Duration.standardSeconds(5))
>  .withNumFileShards(1)
>  .withExtendedErrorInfo()
>  .withSchema(getTableSchema()));
> // result.getFailedInsertsWithErr().apply(ParDo.of(new 
> DoFn() {
> // @ProcessElement
> // public void processElement(ProcessContext c) {
> // for(ErrorProto err : c.element().getError().getErrors()){
> // throw new RuntimeException(err.getMessage());
> // }
> //
> // }
> // }));
>  result.getFailedInserts().apply(ParDo.of(new DoFn() {
>  @ProcessElement
>  public void processElement(ProcessContext c) {
>  System.out.println(c.element());
>  c.output("foo");
>  throw new RuntimeException("Failed to l

[jira] [Commented] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status

2020-08-03 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170127#comment-17170127
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10248:
--

Exception is not just swallowed but logged in the worker logs right ?

Regarding a user hook, currently this can be done by observing the BQ jobs 
started by Dataflow separately, for example using the bq CLI. We are 
standardizing the naming pattern for BQ jobs started by a given Dataflow job 
[1]so that might help identify the correct set of BQ jobs for a given Dataflow 
job.

I understand that it might be easier if users can directly programmatically 
handle failed load jobs from a pipeline but this is a new feature that has to 
be designed separately. [~reuvenlax] for thoughts here.

 

[1] [https://github.com/apache/beam/pull/12082]

 

> Beam does not set correct region for BigQuery when requesting load job status
> -
>
> Key: BEAM-10248
> URL: https://issues.apache.org/jira/browse/BEAM-10248
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.22.0
> Environment: -Beam 2.22.0
> -DirectRunner & DataflowRunner
> -Java 11
>Reporter: Graham Polley
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P1
> Fix For: 2.24.0
>
> Attachments: Untitled document.pdf
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to 
> BigQuery. My BigQuery dataset is in region `australia-southeast1`.
> If the load job into BigQuery fails for some reason (e.g. table does not 
> exist, or schema has changed), then an error from BigQuery is returned for 
> the load.
> Beam then enters into a loop of retrying the load job. However, instead of 
> using a new job id (where the suffix is incremented by 1), it wrongly tries 
> to reinsert the job using the same job id. This is because when it tries to 
> look up if the job id already exists, it does not take into consideration the 
> region where the dataset is. Instead, it defaults to the US but the job was 
> created in `australia-southeast1`, so it returns `null`.
> Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It 
> returns `null` because it's looking in the wrong region.
> If I *test* using a dataset in BigQuery that is in the US region, then it 
> correctly finds the job id and begins to retry the job with a new job id 
> (suffixed with the retry count). We have seen this bug in other areas of Beam 
> before and in other tools/services on GCP e.g. Cloud Composer.
> However, that leads me to my next problem/bug. Even if that is fixed, the 
> number of retries is set to `Integer.MX_VALUE`, so Beam will keep retrying 
> the job 2,147,483,647 times. This is not good.
> The exception is swallowed up by the Beam SDK and never propagated back up 
> the stack for users to catch and handle. So, if a load job fails there is no 
> way to handle it and react for users.
> I attempted to set `withFailedInsertRetryPolicy()` to transient errors only, 
> but it is not supported with `FILE_LOADS`. I also tried, using the 
> `WriteResult` object returned from the bigQuery sink/write, to get a handle 
> on the error but it does not work. Users need a way to catch and catch failed 
> load jobs when using `FILE_LOADS`.
>  
> {code:java}
> public class TemplatePipeline {
>  private static final String TOPIC = 
> "projects/etl-demo-269105/topics/test-micro-batching";
>  private static final String BIGQUERY_DESTINATION_FILTERED = 
> "etl-demo-269105:etl_demo_fun.micro_batch_test_xxx";
>  public static void main(String[] args) throws Exception {
>  try {
>  PipelineOptionsFactory.register(DataflowPipelineOptions.class);
>  DataflowPipelineOptions options = PipelineOptionsFactory
>  .fromArgs(args)
>  .withoutStrictParsing()
>  .as(DataflowPipelineOptions.class);
>  Pipeline pipeline = Pipeline.create(options);
>  PCollection messages = pipeline
>  .apply(PubsubIO.readMessagesWithAttributes().fromTopic(String.format(TOPIC, 
> options.getProject(
>  .apply(Window.into(FixedWindows.of(Duration.standardSeconds(1;
>  WriteResult result = messages.apply(ParDo.of(new RowToBQRow()))
>  .apply(BigQueryIO.writeTableRows()
>  .to(String.format(BIGQUERY_DESTINATION_FILTERED, options.getProject()))
>  .withCreateDisposition(CREATE_NEVER)
>  .withWriteDisposition(WRITE_APPEND)
>  .withMethod(BigQueryIO.Write.Method.FILE_LOADS)
>  .withTriggeringFrequency(Duration.standardSeconds(5))
>  .withNumFileShards(1)
>  .withExtendedErrorInfo()
>  .withSchema(getTableSchema()));
> // result.getFailedInsertsWithErr().apply(ParDo.of(new 
> DoFn() {
> // @ProcessElement
> // public void processElement(ProcessCont

[jira] [Created] (BEAM-10411) Add an example for cross-language Kafka

2020-07-06 Thread Chamikara Madhusanka Jayalath (Jira)
Chamikara Madhusanka Jayalath created BEAM-10411:


 Summary: Add an example for cross-language Kafka
 Key: BEAM-10411
 URL: https://issues.apache.org/jira/browse/BEAM-10411
 Project: Beam
  Issue Type: Improvement
  Components: io-py-kafka
Reporter: Chamikara Madhusanka Jayalath
Assignee: Chamikara Madhusanka Jayalath






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-7014) Flake in gcsio.py / filesystemio.py - NotImplementedError: offset: 0, whence: 0

2020-07-08 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-7014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154243#comment-17154243
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-7014:
-

Another possible occurrence: 
[https://stackoverflow.com/questions/62805319/dataflow-job-fails-with-httperror-notimplementederror/62809015#62809015]

[~udim] will you able to look into this ? Possibly we need to further extend 
seek() implementation here:

[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystemio.py#L286]

 

 

> Flake in gcsio.py / filesystemio.py - NotImplementedError: offset: 0, whence: > 0
> ---
>
> Key: BEAM-7014
> URL: https://issues.apache.org/jira/browse/BEAM-7014
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Priority: P1
>  Labels: beam-fixit, flake
>
> The flake was observed in Precommit Direct Runner IT (wordcount).
> Full log output: https://pastebin.com/raw/DP5J7Uch.
> {noformat}
> Traceback (most recent call last):
> 08:42:57   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/io/gcp/gcsio.py",
>  line 583, in _start_upload
> 08:42:57 self._client.objects.Insert(self._insert_request, 
> upload=self._upload)
> 08:42:57   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/io/gcp/internal/clients/storage/storage_v1_client.py",
>  line 1154, in Insert
> 08:42:57 upload=upload, upload_config=upload_config)
> 08:42:57   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/apitools/base/py/base_api.py",
>  line 715, in _RunMethod
> 08:42:57 http_request, client=self.client)
> 08:42:57   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 885, in InitializeUpload
> 08:42:57 return self.StreamInChunks()
> 08:42:57   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 997, in StreamInChunks
> 08:42:57 additional_headers=additional_headers)
> 08:42:57   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 948, in __StreamMedia
> 08:42:57 self.RefreshResumableUploadState()
> 08:42:57   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 850, in RefreshResumableUploadState
> 08:42:57 self.stream.seek(self.progress)
> 08:42:57   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/io/filesystemio.py",
>  line 269, in seek
> 08:42:57 offset, whence, self.position, self.last_position))
> 08:42:57 NotImplementedError: offset: 0, whence: 0, position: 48944, last: 0
> {noformat}
> [~chamikara] Might have context to triage this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10638) Update documentation of BQ InsertRetryPolicy

2020-08-04 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171091#comment-17171091
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10638:
--

Heejong, assigning to you since you are pretty familiar with this codepath. 

> Update documentation of BQ InsertRetryPolicy
> 
>
> Key: BEAM-10638
> URL: https://issues.apache.org/jira/browse/BEAM-10638
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Reporter: Udi Meiri
>Assignee: Heejong Lee
>Priority: P2
>
> Clarify that it only applies to batches where BigQuery returns per-element 
> failures.
> This may happen in cases where the entire batch is rejected, such as for 
> exceeding the 10MB size limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-10638) Update documentation of BQ InsertRetryPolicy

2020-08-04 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath reassigned BEAM-10638:


Assignee: Heejong Lee

> Update documentation of BQ InsertRetryPolicy
> 
>
> Key: BEAM-10638
> URL: https://issues.apache.org/jira/browse/BEAM-10638
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Reporter: Udi Meiri
>Assignee: Heejong Lee
>Priority: P2
>
> Clarify that it only applies to batches where BigQuery returns per-element 
> failures.
> This may happen in cases where the entire batch is rejected, such as for 
> exceeding the 10MB size limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-10655) NullPointerException when writing "google.protobuf.Timestamp" to BigQuery through schemas

2020-08-06 Thread Chamikara Madhusanka Jayalath (Jira)
Chamikara Madhusanka Jayalath created BEAM-10655:


 Summary: NullPointerException when writing 
"google.protobuf.Timestamp" to BigQuery through schemas
 Key: BEAM-10655
 URL: https://issues.apache.org/jira/browse/BEAM-10655
 Project: Beam
  Issue Type: Improvement
  Components: io-java-gcp
Reporter: Chamikara Madhusanka Jayalath


See here for details and instructions for reproducing.

[https://lists.apache.org/thread.html/r1657ccf41dda3f2d9d082c5ebb006dd2da92863983971ad23485f16e%40%3Cdev.beam.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10655) NullPointerException when writing "google.protobuf.Timestamp" to BigQuery through schemas

2020-08-06 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17172675#comment-17172675
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10655:
--

cc: [~robinyqiu] who has updated BigQueryIO + Beam Schema support recently.

> NullPointerException when writing "google.protobuf.Timestamp" to BigQuery 
> through schemas
> -
>
> Key: BEAM-10655
> URL: https://issues.apache.org/jira/browse/BEAM-10655
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Chamikara Madhusanka Jayalath
>Priority: P2
>
> See here for details and instructions for reproducing.
> [https://lists.apache.org/thread.html/r1657ccf41dda3f2d9d082c5ebb006dd2da92863983971ad23485f16e%40%3Cdev.beam.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9932) Add documentation describing cross-language test pipelines

2020-08-07 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-9932:

Fix Version/s: 2.24.0

> Add documentation describing cross-language test pipelines
> --
>
> Key: BEAM-9932
> URL: https://issues.apache.org/jira/browse/BEAM-9932
> Project: Beam
>  Issue Type: Improvement
>  Components: cross-language
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kevin Sijo Puthusseri
>Priority: P2
> Fix For: 2.24.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> We designed cross-language test pipelines [1][2] based on the discussion in 
> [3].
> Adding some pydocs and Java docs regarding rational behind each pipeline will 
> be helpful.
> [1] 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/validate_runner_xlang_test.py]
> [2] 
> [https://github.com/apache/beam/blob/master/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ValidateRunnerXlangTest.java]
>  [3] 
> [https://docs.google.com/document/d/1xQp0ElIV84b8OCVz8CD2hvbiWdR8w4BvWxPTZJZA6NA/edit?usp=sharing]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9932) Add documentation describing cross-language test pipelines

2020-08-07 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-9932:

Status: Resolved  (was: In Progress)

> Add documentation describing cross-language test pipelines
> --
>
> Key: BEAM-9932
> URL: https://issues.apache.org/jira/browse/BEAM-9932
> Project: Beam
>  Issue Type: Improvement
>  Components: cross-language
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kevin Sijo Puthusseri
>Priority: P2
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> We designed cross-language test pipelines [1][2] based on the discussion in 
> [3].
> Adding some pydocs and Java docs regarding rational behind each pipeline will 
> be helpful.
> [1] 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/validate_runner_xlang_test.py]
> [2] 
> [https://github.com/apache/beam/blob/master/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ValidateRunnerXlangTest.java]
>  [3] 
> [https://docs.google.com/document/d/1xQp0ElIV84b8OCVz8CD2hvbiWdR8w4BvWxPTZJZA6NA/edit?usp=sharing]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10529) Kafka XLang fails for ?empty? key/values

2020-08-07 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173425#comment-17173425
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10529:
--

cc: [~robertwb] [~bhulette] 

> Kafka XLang fails for ?empty? key/values
> 
>
> Key: BEAM-10529
> URL: https://issues.apache.org/jira/browse/BEAM-10529
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kafka, sdk-py-core
>Reporter: Luke Cwik
>Priority: P2
>
> It looks like the Javadoc for ByteArrayDeserializer and StringDeserializer 
> can return null[1, 2] and we aren't using 
> NullableCoder.of(ByteArrayCoder.of()) in the expansion[3]. Note that KafkaIO 
> does this correctly in its regular coder inference logic[4].
> 1: 
> https://kafka.apache.org/21/javadoc/org/apache/kafka/common/serialization/ByteArrayDeserializer.html#deserialize-java.lang.String-byte:A-2:
>  
> https://kafka.apache.org/21/javadoc/org/apache/kafka/common/serialization/StringDeserializer.html#deserialize-java.lang.String-byte:A-
> 3: 
> https://github.com/apache/beam/blob/af2d6b0379d64b522ecb769d88e9e7e7b8900208/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L478
> 4: 
> https://github.com/apache/beam/blob/af2d6b0379d64b522ecb769d88e9e7e7b8900208/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/LocalDeserializerProvider.java#L85



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10529) Kafka XLang fails for ?empty? key/values

2020-08-07 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173431#comment-17173431
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10529:
--

Are the Python and Java nullable coders compatible ? I.e. can you encode from 
Java and decode from Python and vice versa ?

> Kafka XLang fails for ?empty? key/values
> 
>
> Key: BEAM-10529
> URL: https://issues.apache.org/jira/browse/BEAM-10529
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kafka, sdk-py-core
>Reporter: Luke Cwik
>Priority: P2
>
> It looks like the Javadoc for ByteArrayDeserializer and StringDeserializer 
> can return null[1, 2] and we aren't using 
> NullableCoder.of(ByteArrayCoder.of()) in the expansion[3]. Note that KafkaIO 
> does this correctly in its regular coder inference logic[4].
> 1: 
> https://kafka.apache.org/21/javadoc/org/apache/kafka/common/serialization/ByteArrayDeserializer.html#deserialize-java.lang.String-byte:A-2:
>  
> https://kafka.apache.org/21/javadoc/org/apache/kafka/common/serialization/StringDeserializer.html#deserialize-java.lang.String-byte:A-
> 3: 
> https://github.com/apache/beam/blob/af2d6b0379d64b522ecb769d88e9e7e7b8900208/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L478
> 4: 
> https://github.com/apache/beam/blob/af2d6b0379d64b522ecb769d88e9e7e7b8900208/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/LocalDeserializerProvider.java#L85



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9922) Add Go SDK tests to cross-language Spark ValidatesRunner test suite

2020-08-07 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-9922:

Description: 
Test suite is here: [https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/]

 

  was:
Test suite is here: 
[https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Spark/]

 


> Add Go SDK tests to cross-language Spark ValidatesRunner test suite
> ---
>
> Key: BEAM-9922
> URL: https://issues.apache.org/jira/browse/BEAM-9922
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-go
>Reporter: Chamikara Madhusanka Jayalath
>Priority: P3
>
> Test suite is here: 
> [https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9921) Add Go SDK tests to cross-language Flink ValidatesRunner test suite

2020-08-07 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-9921:

Description: Test suite is here: 
[https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/]  (was: Test suite 
is here: 
[https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Flink/])

> Add Go SDK tests to cross-language Flink ValidatesRunner test suite
> ---
>
> Key: BEAM-9921
> URL: https://issues.apache.org/jira/browse/BEAM-9921
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-go
>Reporter: Chamikara Madhusanka Jayalath
>Priority: P3
>
> Test suite is here: 
> [https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-9918) Cross-language transforms support for Go SDK

2020-08-07 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath reassigned BEAM-9918:
---

Assignee: (was: Kevin Sijo Puthusseri)

> Cross-language transforms support for Go SDK
> 
>
> Key: BEAM-9918
> URL: https://issues.apache.org/jira/browse/BEAM-9918
> Project: Beam
>  Issue Type: New Feature
>  Components: cross-language, sdk-go
>Reporter: Chamikara Madhusanka Jayalath
>Priority: P2
>
> This is an uber issue for tasks related to cross-language transforms support 
> for Go SDK. We can create sub-tasks as needed.
> cc: [~lostluck]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-10698) X-lang Kafka is broken for DataflowRunner due to timestamps being out of bound

2020-08-12 Thread Chamikara Madhusanka Jayalath (Jira)
Chamikara Madhusanka Jayalath created BEAM-10698:


 Summary: X-lang Kafka is broken for DataflowRunner due to 
timestamps being out of bound
 Key: BEAM-10698
 URL: https://issues.apache.org/jira/browse/BEAM-10698
 Project: Beam
  Issue Type: Bug
  Components: cross-language, io-py-kafka
Reporter: Chamikara Madhusanka Jayalath
Assignee: Boyuan Zhang


Seems like this is a regression introduced by 
[https://github.com/apache/beam/pull/11749]

I have a short term fix here: [https://github.com/apache/beam/pull/12557]

We should either.

(1) Include [https://github.com/apache/beam/pull/12557] in the release branch

(2) Find the underlying issue that produces wrong timestamps and do that fix

(3) Revert [https://github.com/apache/beam/pull/11749] from the 2.24.0 release 
branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10698) X-lang Kafka is broken for DataflowRunner due to timestamps being out of bound

2020-08-12 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-10698:
-
Fix Version/s: 2.24.0

> X-lang Kafka is broken for DataflowRunner due to timestamps being out of bound
> --
>
> Key: BEAM-10698
> URL: https://issues.apache.org/jira/browse/BEAM-10698
> Project: Beam
>  Issue Type: Bug
>  Components: cross-language, io-py-kafka
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Boyuan Zhang
>Priority: P1
> Fix For: 2.24.0
>
>
> Seems like this is a regression introduced by 
> [https://github.com/apache/beam/pull/11749]
> I have a short term fix here: [https://github.com/apache/beam/pull/12557]
> We should either.
> (1) Include [https://github.com/apache/beam/pull/12557] in the release branch
> (2) Find the underlying issue that produces wrong timestamps and do that fix
> (3) Revert [https://github.com/apache/beam/pull/11749] from the 2.24.0 
> release branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10698) SDFs broken for DataflowRunner o UW due to timestamps being out of bound

2020-08-13 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-10698:
-
Summary: SDFs broken for DataflowRunner o UW due to timestamps being out of 
bound  (was: X-lang Kafka is broken for DataflowRunner due to timestamps being 
out of bound)

> SDFs broken for DataflowRunner o UW due to timestamps being out of bound
> 
>
> Key: BEAM-10698
> URL: https://issues.apache.org/jira/browse/BEAM-10698
> Project: Beam
>  Issue Type: Bug
>  Components: cross-language, io-py-kafka
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Boyuan Zhang
>Priority: P1
> Fix For: 2.24.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Seems like this is a regression introduced by 
> [https://github.com/apache/beam/pull/11749]
> I have a short term fix here: [https://github.com/apache/beam/pull/12557]
> We should either.
> (1) Include [https://github.com/apache/beam/pull/12557] in the release branch
> (2) Find the underlying issue that produces wrong timestamps and do that fix
> (3) Revert [https://github.com/apache/beam/pull/11749] from the 2.24.0 
> release branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10698) SDFs broken for DataflowRunner o UW due to timestamps being out of bound

2020-08-13 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-10698:
-
Priority: P0  (was: P1)

> SDFs broken for DataflowRunner o UW due to timestamps being out of bound
> 
>
> Key: BEAM-10698
> URL: https://issues.apache.org/jira/browse/BEAM-10698
> Project: Beam
>  Issue Type: Bug
>  Components: cross-language, io-py-kafka
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Boyuan Zhang
>Priority: P0
> Fix For: 2.24.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Seems like this is a regression introduced by 
> [https://github.com/apache/beam/pull/11749]
> I have a short term fix here: [https://github.com/apache/beam/pull/12557]
> We should either.
> (1) Include [https://github.com/apache/beam/pull/12557] in the release branch
> (2) Find the underlying issue that produces wrong timestamps and do that fix
> (3) Revert [https://github.com/apache/beam/pull/11749] from the 2.24.0 
> release branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10698) SDFs broken for Dataflow runner v2 due to timestamps being out of bound

2020-08-13 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-10698:
-
Summary: SDFs broken for Dataflow runner v2 due to timestamps being out of 
bound  (was: SDFs broken for DataflowRunner o UW due to timestamps being out of 
bound)

> SDFs broken for Dataflow runner v2 due to timestamps being out of bound
> ---
>
> Key: BEAM-10698
> URL: https://issues.apache.org/jira/browse/BEAM-10698
> Project: Beam
>  Issue Type: Bug
>  Components: cross-language, io-py-kafka
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Boyuan Zhang
>Priority: P0
> Fix For: 2.24.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Seems like this is a regression introduced by 
> [https://github.com/apache/beam/pull/11749]
> I have a short term fix here: [https://github.com/apache/beam/pull/12557]
> We should either.
> (1) Include [https://github.com/apache/beam/pull/12557] in the release branch
> (2) Find the underlying issue that produces wrong timestamps and do that fix
> (3) Revert [https://github.com/apache/beam/pull/11749] from the 2.24.0 
> release branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10698) SDFs broken for Dataflow runner v2 due to timestamps being out of bound

2020-08-13 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath updated BEAM-10698:
-
Component/s: runner-dataflow

> SDFs broken for Dataflow runner v2 due to timestamps being out of bound
> ---
>
> Key: BEAM-10698
> URL: https://issues.apache.org/jira/browse/BEAM-10698
> Project: Beam
>  Issue Type: Bug
>  Components: cross-language, io-py-kafka, runner-dataflow
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Boyuan Zhang
>Priority: P0
> Fix For: 2.24.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Seems like this is a regression introduced by 
> [https://github.com/apache/beam/pull/11749]
> I have a short term fix here: [https://github.com/apache/beam/pull/12557]
> We should either.
> (1) Include [https://github.com/apache/beam/pull/12557] in the release branch
> (2) Find the underlying issue that produces wrong timestamps and do that fix
> (3) Revert [https://github.com/apache/beam/pull/11749] from the 2.24.0 
> release branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10716) PubSub related tests failing due to subscriptions-per-project exceeded limit:10000

2020-08-18 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179988#comment-17179988
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-10716:
--

I ran into this a number of times when running Python PreCommit. For example,

 

[https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/2108/]

 

ERROR: test_streaming_wordcount_it 
(apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT)
--
Traceback (most recent call last):
 File 
"/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/sdks/python/apache_beam/examples/streaming_wordcount_it_test.py",
 line 65, in setUp
 self.input_topic.name)
 File 
"/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/build/gradleenv/-1734967052/lib/python3.7/site-packages/google/cloud/pubsub_v1/_gapic.py",
 line 40, in 
 fx = lambda self, *a, **kw: wrapped_fx(self.api, *a, **kw) # noqa
 File 
"/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/build/gradleenv/-1734967052/lib/python3.7/site-packages/google/cloud/pubsub_v1/gapic/subscriber_client.py",
 line 439, in create_subscription
 request, retry=retry, timeout=timeout, metadata=metadata
 File 
"/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/build/gradleenv/-1734967052/lib/python3.7/site-packages/google/api_core/gapic_v1/method.py",
 line 145, in __call__
 return wrapped_func(*args, **kwargs)
 File 
"/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/build/gradleenv/-1734967052/lib/python3.7/site-packages/google/api_core/retry.py",
 line 286, in retry_wrapped_func
 on_error=on_error,
 File 
"/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/build/gradleenv/-1734967052/lib/python3.7/site-packages/google/api_core/retry.py",
 line 184, in retry_target
 return target()
 File 
"/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/build/gradleenv/-1734967052/lib/python3.7/site-packages/google/api_core/timeout.py",
 line 214, in func_with_timeout
 return func(*args, **kwargs)
 File 
"/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/build/gradleenv/-1734967052/lib/python3.7/site-packages/google/api_core/grpc_helpers.py",
 line 59, in error_remapped_callable
 six.raise_from(exceptions.from_grpc_error(exc), exc)
 File "", line 3, in raise_from
google.api_core.exceptions.ResourceExhausted: 429 Your project has exceeded a 
limit: (type="subscriptions-per-project", current=1, maximum=1).

> PubSub related tests failing due to subscriptions-per-project exceeded 
> limit:1
> --
>
> Key: BEAM-10716
> URL: https://issues.apache.org/jira/browse/BEAM-10716
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Robin Qiu
>Assignee: Alan Myrvold
>Priority: P1
>  Labels: currently-failing
>
> Example: 
> [https://ci-beam.apache.org/job/beam_PostCommit_Python37/2729/#showFailuresLink]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-9918) Cross-language transforms support for Go SDK

2020-06-15 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath reassigned BEAM-9918:
---

Assignee: Kevin Sijo Puthusseri

> Cross-language transforms support for Go SDK
> 
>
> Key: BEAM-9918
> URL: https://issues.apache.org/jira/browse/BEAM-9918
> Project: Beam
>  Issue Type: New Feature
>  Components: cross-language, sdk-go
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kevin Sijo Puthusseri
>Priority: P2
>
> This is an uber issue for tasks related to cross-language transforms support 
> for Go SDK. We can create sub-tasks as needed.
> cc: [~lostluck]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-6485) Cross-language transform expansion protocol

2020-06-16 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath resolved BEAM-6485.
-
Fix Version/s: Not applicable
   Resolution: Fixed

>  Cross-language transform expansion protocol
> 
>
> Key: BEAM-6485
> URL: https://issues.apache.org/jira/browse/BEAM-6485
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model, runner-core, sdk-java-core, sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Priority: P3
> Fix For: Not applicable
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-10271) Remove experimental annotations from stabilized GCP connector APIs

2020-06-17 Thread Chamikara Madhusanka Jayalath (Jira)
Chamikara Madhusanka Jayalath created BEAM-10271:


 Summary: Remove experimental annotations from stabilized GCP 
connector APIs
 Key: BEAM-10271
 URL: https://issues.apache.org/jira/browse/BEAM-10271
 Project: Beam
  Issue Type: Improvement
  Components: io-java-gcp, io-py-gcp
Reporter: Chamikara Madhusanka Jayalath






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-9932) Add documentation describing cross-language test pipelines

2020-06-18 Thread Chamikara Madhusanka Jayalath (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Madhusanka Jayalath reassigned BEAM-9932:
---

Assignee: Kevin Sijo Puthusseri

> Add documentation describing cross-language test pipelines
> --
>
> Key: BEAM-9932
> URL: https://issues.apache.org/jira/browse/BEAM-9932
> Project: Beam
>  Issue Type: Improvement
>  Components: cross-language
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kevin Sijo Puthusseri
>Priority: P2
>
> We designed cross-language test pipelines [1][2] based on the discussion in 
> [3].
> Adding some pydocs and Java docs regarding rational behind each pipeline will 
> be helpful.
> [1] 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/validate_runner_xlang_test.py]
> [2] 
> [https://github.com/apache/beam/blob/master/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ValidateRunnerXlangTest.java]
>  [3] 
> [https://docs.google.com/document/d/1xQp0ElIV84b8OCVz8CD2hvbiWdR8w4BvWxPTZJZA6NA/edit?usp=sharing]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-6514) Dataflow Batch Job Failure is leaving Datasets/Tables behind in BigQuery

2020-06-18 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139950#comment-17139950
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-6514:
-

This is because there's currently no good way to cleanup failing Dataflow/Beam 
jobs. If the job is successful BQ connector will perform the cleanup if the job 
fails cleanup may have to be done manually.

> Dataflow Batch Job Failure is leaving Datasets/Tables behind in BigQuery
> 
>
> Key: BEAM-6514
> URL: https://issues.apache.org/jira/browse/BEAM-6514
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Reporter: Rumeshkrishnan Mohan
>Assignee: Chamikara Madhusanka Jayalath
>Priority: P2
>  Labels: stale-assigned
>
> Dataflow is leaving Datasets/Tables behind in BigQuery when the pipeline is 
> cancelled or when it fails. I cancelled a job or it failed at run time, and 
> it left behind a dataset and table in BigQuery.
>  # `cleanupTempResource` method involves cleaning tables and dataset after 
> batch job succeed.
>  # If job failed in the middle or cancelled explicitly, the temporary dataset 
> and tables remain exist. I do see the table expire period 1 day as per code 
> in `getTableToExtract` function written in BigQueryQuerySource.java.
>  # I can understand that, keep temp tables and dataset when failure for 
> debugging.
>  # Can we have pipeline or job optional parameters which get clean temporary 
> dataset and tables when cancel or fail ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   9   10   >