[jira] [Created] (BEAM-10861) Adds URNs and payloads to PubSub transforms
Chamikara Madhusanka Jayalath created BEAM-10861: Summary: Adds URNs and payloads to PubSub transforms Key: BEAM-10861 URL: https://issues.apache.org/jira/browse/BEAM-10861 Project: Beam Issue Type: Bug Components: cross-language, runner-dataflow, sdk-py-core Reporter: Chamikara Madhusanka Jayalath This is needed to allow runners to override portable definition of transforms. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8376) Add FirestoreIO connector to Java SDK
[ https://issues.apache.org/jira/browse/BEAM-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17193256#comment-17193256 ] Chamikara Madhusanka Jayalath commented on BEAM-8376: - Firestore team is currently looking into this. > Add FirestoreIO connector to Java SDK > - > > Key: BEAM-8376 > URL: https://issues.apache.org/jira/browse/BEAM-8376 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Stefan Djelekar >Priority: P3 > Time Spent: 4h 10m > Remaining Estimate: 0h > > Motivation: > There is no Firestore connector for Java SDK at the moment. > Having it will enhance the integrations with database options on the Google > Cloud Platform. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-2717) Infer coders in SDK prior to handing off pipeline to Runner
[ https://issues.apache.org/jira/browse/BEAM-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208215#comment-17208215 ] Chamikara Madhusanka Jayalath commented on BEAM-2717: - I'm not currently working on this. This sounds like a simplification we can do but not a blocker for portable job submission. > Infer coders in SDK prior to handing off pipeline to Runner > --- > > Key: BEAM-2717 > URL: https://issues.apache.org/jira/browse/BEAM-2717 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Robert Bradshaw >Priority: P1 > Time Spent: 3h 10m > Remaining Estimate: 0h > > Currently all runners have to duplicate this work, and there's also a hack > storing the element type rather than the coder in the Runner protos. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-11026) When determining expansion service jar, depend on Maven coords instead of Gradle target
Chamikara Madhusanka Jayalath created BEAM-11026: Summary: When determining expansion service jar, depend on Maven coords instead of Gradle target Key: BEAM-11026 URL: https://issues.apache.org/jira/browse/BEAM-11026 Project: Beam Issue Type: Bug Components: cross-language, sdk-py-core Reporter: Chamikara Madhusanka Jayalath For example here: https://github.com/apache/beam/blob/release-2.24.0/sdks/python/apache_beam/io/kafka.py#L107 Maven coords is more stable than the Gradle target name so it's better to depend on the prior. See [1] for details. [1] https://issues.apache.org/jira/browse/BEAM-10986 cc: [~robertwb] [~bhulette] [~kenn] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-11026) When determining expansion service jar, depend on Maven coords instead of Gradle target
[ https://issues.apache.org/jira/browse/BEAM-11026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-11026: - Issue Type: Improvement (was: Bug) > When determining expansion service jar, depend on Maven coords instead of > Gradle target > --- > > Key: BEAM-11026 > URL: https://issues.apache.org/jira/browse/BEAM-11026 > Project: Beam > Issue Type: Improvement > Components: cross-language, sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Priority: P2 > > For example here: > https://github.com/apache/beam/blob/release-2.24.0/sdks/python/apache_beam/io/kafka.py#L107 > Maven coords is more stable than the Gradle target name so it's better to > depend on the prior. See [1] for details. > [1] https://issues.apache.org/jira/browse/BEAM-10986 > > cc: [~robertwb] [~bhulette] [~kenn] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10986) :build task doesn't build shaded jar with shadow >5.0.0
[ https://issues.apache.org/jira/browse/BEAM-10986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209129#comment-17209129 ] Chamikara Madhusanka Jayalath commented on BEAM-10986: -- Thanks for root causing and fixing this Brian. Filed https://issues.apache.org/jira/browse/BEAM-11026 for depending on Maven coords instead of Gradle target. > :build task doesn't build shaded jar with shadow >5.0.0 > --- > > Key: BEAM-10986 > URL: https://issues.apache.org/jira/browse/BEAM-10986 > Project: Beam > Issue Type: Bug > Components: build-system >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: P0 > Fix For: 2.25.0 > > Time Spent: 5h 10m > Remaining Estimate: 0h > > Not 100% sure but I believe this will prevent us from releasing shaded jars. > See > [https://lists.apache.org/thread.html/r12c785c2b01017329204ad816740e80b70c19c5f2bb1ea01535a987d%40%3Cdev.beam.apache.org%3E] > for details. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-11033) Update Dataflow metrics processor to handle portable jobs
Chamikara Madhusanka Jayalath created BEAM-11033: Summary: Update Dataflow metrics processor to handle portable jobs Key: BEAM-11033 URL: https://issues.apache.org/jira/browse/BEAM-11033 Project: Beam Issue Type: Bug Components: sdk-py-core Reporter: Chamikara Madhusanka Jayalath Assignee: Chamikara Madhusanka Jayalath Currently, Dataflow metrics processor expects Dataflow internal step names generated for v1beta3 job description in metrics returned by Dataflow service: [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/dataflow_metrics.py#L97] But with portable job submission, Dataflow uses PTransform ID (in proto pipeline) as the internal step name. Hence metrics processor should be updated to handle this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-11033) Update Dataflow metrics processor to handle portable jobs
[ https://issues.apache.org/jira/browse/BEAM-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-11033: - Status: Resolved (was: Open) > Update Dataflow metrics processor to handle portable jobs > - > > Key: BEAM-11033 > URL: https://issues.apache.org/jira/browse/BEAM-11033 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Chamikara Madhusanka Jayalath >Priority: P2 > Time Spent: 2h 20m > Remaining Estimate: 0h > > Currently, Dataflow metrics processor expects Dataflow internal step names > generated for v1beta3 job description in metrics returned by Dataflow > service: > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/dataflow_metrics.py#L97] > > But with portable job submission, Dataflow uses PTransform ID (in proto > pipeline) as the internal step name. Hence metrics processor should be > updated to handle this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10904) Set sdk_harness_container_images property for all Dataflow Runner v2 jobs
[ https://issues.apache.org/jira/browse/BEAM-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215536#comment-17215536 ] Chamikara Madhusanka Jayalath commented on BEAM-10904: -- This was done in https://github.com/apache/beam/pull/12853. > Set sdk_harness_container_images property for all Dataflow Runner v2 jobs > - > > Key: BEAM-10904 > URL: https://issues.apache.org/jira/browse/BEAM-10904 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Reporter: Chamikara Madhusanka Jayalath >Assignee: Chamikara Madhusanka Jayalath >Priority: P2 > Labels: stale-assigned > > Currently this is only set for x-lang jobs. > We can set this for all jobs now that corresponding Dataflow API is stable. > This will make container setup for x-lang and non-x-lang jobs for runner V2 > behave in a similar manner. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10904) Set sdk_harness_container_images property for all Dataflow Runner v2 jobs
[ https://issues.apache.org/jira/browse/BEAM-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-10904: - Status: Resolved (was: Open) > Set sdk_harness_container_images property for all Dataflow Runner v2 jobs > - > > Key: BEAM-10904 > URL: https://issues.apache.org/jira/browse/BEAM-10904 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Reporter: Chamikara Madhusanka Jayalath >Assignee: Chamikara Madhusanka Jayalath >Priority: P2 > Labels: stale-assigned > > Currently this is only set for x-lang jobs. > We can set this for all jobs now that corresponding Dataflow API is stable. > This will make container setup for x-lang and non-x-lang jobs for runner V2 > behave in a similar manner. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-11002) XmlIO buffer overflow exception
[ https://issues.apache.org/jira/browse/BEAM-11002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath reassigned BEAM-11002: Assignee: Chamikara Madhusanka Jayalath > XmlIO buffer overflow exception > > > Key: BEAM-11002 > URL: https://issues.apache.org/jira/browse/BEAM-11002 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Affects Versions: 2.23.0, 2.24.0 >Reporter: Duncan Lew >Assignee: Chamikara Madhusanka Jayalath >Priority: P0 > Labels: Clarified > > We're making using of Apache Beam in Google Dataflow. > We're using XmlIO to read in an XML file with such a setup > {code:java} > pipeline > .apply("Read Storage Bucket", > XmlIO.read() > .from(sourcePath) > .withRootElement(xmlProductRoot) > .withRecordElement(xmlProductRecord) > .withRecordClass(XmlProduct::class.java) > ) > {code} > However, from time to time, we're getting buffer overflow exception from > reading random xml files: > {code:java} > "Error message from worker: java.io.IOException: Failed to start reading from > source: gs://path-to-xml-file.xml range [1722550, 2684411) > > org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:610) > > org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:359) > > org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:194) > > org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159) > > org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77) > > org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:417) > > org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:386) > > org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:311) > > org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140) > > org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120) > > org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107) > java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > java.base/java.lang.Thread.run(Thread.java:834) > Caused by: java.nio.BufferOverflowException > java.base/java.nio.Buffer.nextPutIndex(Buffer.java:662) > java.base/java.nio.HeapByteBuffer.put(HeapByteBuffer.java:196) > > org.apache.beam.sdk.io.xml.XmlSource$XMLReader.getFirstOccurenceOfRecordElement(XmlSource.java:285) > > org.apache.beam.sdk.io.xml.XmlSource$XMLReader.startReading(XmlSource.java:192) > > org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:476) > > org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:249) > > org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:607) > ... 14 more > {code} > We can't reproduce this buffer overflow exception locally with the > DirectRunner. If we rerun the dataflow job in the Google Cloud, it can run > correctly without any exceptions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-11002) XmlIO buffer overflow exception
[ https://issues.apache.org/jira/browse/BEAM-11002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217662#comment-17217662 ] Chamikara Madhusanka Jayalath commented on BEAM-11002: -- This should not be a P0 unless something was broken between releases. Is there an input that can be used to reproduce this ? > XmlIO buffer overflow exception > > > Key: BEAM-11002 > URL: https://issues.apache.org/jira/browse/BEAM-11002 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Affects Versions: 2.23.0, 2.24.0 >Reporter: Duncan Lew >Priority: P0 > Labels: Clarified > > We're making using of Apache Beam in Google Dataflow. > We're using XmlIO to read in an XML file with such a setup > {code:java} > pipeline > .apply("Read Storage Bucket", > XmlIO.read() > .from(sourcePath) > .withRootElement(xmlProductRoot) > .withRecordElement(xmlProductRecord) > .withRecordClass(XmlProduct::class.java) > ) > {code} > However, from time to time, we're getting buffer overflow exception from > reading random xml files: > {code:java} > "Error message from worker: java.io.IOException: Failed to start reading from > source: gs://path-to-xml-file.xml range [1722550, 2684411) > > org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:610) > > org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:359) > > org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:194) > > org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159) > > org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77) > > org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:417) > > org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:386) > > org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:311) > > org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140) > > org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120) > > org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107) > java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > java.base/java.lang.Thread.run(Thread.java:834) > Caused by: java.nio.BufferOverflowException > java.base/java.nio.Buffer.nextPutIndex(Buffer.java:662) > java.base/java.nio.HeapByteBuffer.put(HeapByteBuffer.java:196) > > org.apache.beam.sdk.io.xml.XmlSource$XMLReader.getFirstOccurenceOfRecordElement(XmlSource.java:285) > > org.apache.beam.sdk.io.xml.XmlSource$XMLReader.startReading(XmlSource.java:192) > > org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:476) > > org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:249) > > org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:607) > ... 14 more > {code} > We can't reproduce this buffer overflow exception locally with the > DirectRunner. If we rerun the dataflow job in the Google Cloud, it can run > correctly without any exceptions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-11002) XmlIO buffer overflow exception
[ https://issues.apache.org/jira/browse/BEAM-11002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-11002: - Priority: P1 (was: P0) > XmlIO buffer overflow exception > > > Key: BEAM-11002 > URL: https://issues.apache.org/jira/browse/BEAM-11002 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Affects Versions: 2.23.0, 2.24.0 >Reporter: Duncan Lew >Assignee: Chamikara Madhusanka Jayalath >Priority: P1 > Labels: Clarified > > We're making using of Apache Beam in Google Dataflow. > We're using XmlIO to read in an XML file with such a setup > {code:java} > pipeline > .apply("Read Storage Bucket", > XmlIO.read() > .from(sourcePath) > .withRootElement(xmlProductRoot) > .withRecordElement(xmlProductRecord) > .withRecordClass(XmlProduct::class.java) > ) > {code} > However, from time to time, we're getting buffer overflow exception from > reading random xml files: > {code:java} > "Error message from worker: java.io.IOException: Failed to start reading from > source: gs://path-to-xml-file.xml range [1722550, 2684411) > > org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:610) > > org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:359) > > org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:194) > > org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159) > > org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77) > > org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:417) > > org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:386) > > org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:311) > > org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140) > > org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120) > > org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107) > java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > java.base/java.lang.Thread.run(Thread.java:834) > Caused by: java.nio.BufferOverflowException > java.base/java.nio.Buffer.nextPutIndex(Buffer.java:662) > java.base/java.nio.HeapByteBuffer.put(HeapByteBuffer.java:196) > > org.apache.beam.sdk.io.xml.XmlSource$XMLReader.getFirstOccurenceOfRecordElement(XmlSource.java:285) > > org.apache.beam.sdk.io.xml.XmlSource$XMLReader.startReading(XmlSource.java:192) > > org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:476) > > org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:249) > > org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:607) > ... 14 more > {code} > We can't reproduce this buffer overflow exception locally with the > DirectRunner. If we rerun the dataflow job in the Google Cloud, it can run > correctly without any exceptions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-11002) XmlIO buffer overflow exception
[ https://issues.apache.org/jira/browse/BEAM-11002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-11002: - Component/s: (was: sdk-java-core) io-java-xml > XmlIO buffer overflow exception > > > Key: BEAM-11002 > URL: https://issues.apache.org/jira/browse/BEAM-11002 > Project: Beam > Issue Type: Bug > Components: io-java-xml >Affects Versions: 2.23.0, 2.24.0 >Reporter: Duncan Lew >Assignee: Chamikara Madhusanka Jayalath >Priority: P1 > Labels: Clarified > > We're making using of Apache Beam in Google Dataflow. > We're using XmlIO to read in an XML file with such a setup > {code:java} > pipeline > .apply("Read Storage Bucket", > XmlIO.read() > .from(sourcePath) > .withRootElement(xmlProductRoot) > .withRecordElement(xmlProductRecord) > .withRecordClass(XmlProduct::class.java) > ) > {code} > However, from time to time, we're getting buffer overflow exception from > reading random xml files: > {code:java} > "Error message from worker: java.io.IOException: Failed to start reading from > source: gs://path-to-xml-file.xml range [1722550, 2684411) > > org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:610) > > org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:359) > > org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:194) > > org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159) > > org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77) > > org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:417) > > org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:386) > > org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:311) > > org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140) > > org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120) > > org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107) > java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > java.base/java.lang.Thread.run(Thread.java:834) > Caused by: java.nio.BufferOverflowException > java.base/java.nio.Buffer.nextPutIndex(Buffer.java:662) > java.base/java.nio.HeapByteBuffer.put(HeapByteBuffer.java:196) > > org.apache.beam.sdk.io.xml.XmlSource$XMLReader.getFirstOccurenceOfRecordElement(XmlSource.java:285) > > org.apache.beam.sdk.io.xml.XmlSource$XMLReader.startReading(XmlSource.java:192) > > org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:476) > > org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:249) > > org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:607) > ... 14 more > {code} > We can't reproduce this buffer overflow exception locally with the > DirectRunner. If we rerun the dataflow job in the Google Cloud, it can run > correctly without any exceptions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-11002) XmlIO buffer overflow exception
[ https://issues.apache.org/jira/browse/BEAM-11002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217693#comment-17217693 ] Chamikara Madhusanka Jayalath commented on BEAM-11002: -- Based on the changes, doesn't look like something released to XmlIO changed recently: https://github.com/apache/beam/commits/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlIO.java > XmlIO buffer overflow exception > > > Key: BEAM-11002 > URL: https://issues.apache.org/jira/browse/BEAM-11002 > Project: Beam > Issue Type: Bug > Components: io-java-xml >Affects Versions: 2.23.0, 2.24.0 >Reporter: Duncan Lew >Assignee: Chamikara Madhusanka Jayalath >Priority: P1 > Labels: Clarified > > We're making using of Apache Beam in Google Dataflow. > We're using XmlIO to read in an XML file with such a setup > {code:java} > pipeline > .apply("Read Storage Bucket", > XmlIO.read() > .from(sourcePath) > .withRootElement(xmlProductRoot) > .withRecordElement(xmlProductRecord) > .withRecordClass(XmlProduct::class.java) > ) > {code} > However, from time to time, we're getting buffer overflow exception from > reading random xml files: > {code:java} > "Error message from worker: java.io.IOException: Failed to start reading from > source: gs://path-to-xml-file.xml range [1722550, 2684411) > > org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:610) > > org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:359) > > org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:194) > > org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159) > > org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77) > > org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:417) > > org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:386) > > org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:311) > > org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140) > > org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120) > > org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107) > java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > java.base/java.lang.Thread.run(Thread.java:834) > Caused by: java.nio.BufferOverflowException > java.base/java.nio.Buffer.nextPutIndex(Buffer.java:662) > java.base/java.nio.HeapByteBuffer.put(HeapByteBuffer.java:196) > > org.apache.beam.sdk.io.xml.XmlSource$XMLReader.getFirstOccurenceOfRecordElement(XmlSource.java:285) > > org.apache.beam.sdk.io.xml.XmlSource$XMLReader.startReading(XmlSource.java:192) > > org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:476) > > org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:249) > > org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:607) > ... 14 more > {code} > We can't reproduce this buffer overflow exception locally with the > DirectRunner. If we rerun the dataflow job in the Google Cloud, it can run > correctly without any exceptions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10967) adding validate runner for Dataflow runner v2 to Java SDK
[ https://issues.apache.org/jira/browse/BEAM-10967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217858#comment-17217858 ] Chamikara Madhusanka Jayalath commented on BEAM-10967: -- What's the next step here ? Test suite does not seem to be running - https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_V2/ > adding validate runner for Dataflow runner v2 to Java SDK > - > > Key: BEAM-10967 > URL: https://issues.apache.org/jira/browse/BEAM-10967 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow, sdk-java-harness >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: P2 > Labels: Clarified > Time Spent: 4h 40m > Remaining Estimate: 0h > > adding validate runner for Dataflow runner v2 to Java SDK (Java harness) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10967) adding validate runner for Dataflow runner v2 to Java SDK
[ https://issues.apache.org/jira/browse/BEAM-10967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217891#comment-17217891 ] Chamikara Madhusanka Jayalath commented on BEAM-10967: -- Thanks. cc: [~kenn] [~tysonjh] > adding validate runner for Dataflow runner v2 to Java SDK > - > > Key: BEAM-10967 > URL: https://issues.apache.org/jira/browse/BEAM-10967 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow, sdk-java-harness >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: P2 > Labels: Clarified > Time Spent: 4h 40m > Remaining Estimate: 0h > > adding validate runner for Dataflow runner v2 to Java SDK (Java harness) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10967) adding validate runner for Dataflow runner v2 to Java SDK
[ https://issues.apache.org/jira/browse/BEAM-10967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217951#comment-17217951 ] Chamikara Madhusanka Jayalath commented on BEAM-10967: -- Assigning to Tyson to disable the new/falling tests here and resolve this when appropriate. > adding validate runner for Dataflow runner v2 to Java SDK > - > > Key: BEAM-10967 > URL: https://issues.apache.org/jira/browse/BEAM-10967 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow, sdk-java-harness >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: P2 > Labels: Clarified > Time Spent: 4h 40m > Remaining Estimate: 0h > > adding validate runner for Dataflow runner v2 to Java SDK (Java harness) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-10967) adding validate runner for Dataflow runner v2 to Java SDK
[ https://issues.apache.org/jira/browse/BEAM-10967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath reassigned BEAM-10967: Assignee: Tyson Hamilton (was: Heejong Lee) > adding validate runner for Dataflow runner v2 to Java SDK > - > > Key: BEAM-10967 > URL: https://issues.apache.org/jira/browse/BEAM-10967 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow, sdk-java-harness >Reporter: Heejong Lee >Assignee: Tyson Hamilton >Priority: P2 > Labels: Clarified > Time Spent: 4h 40m > Remaining Estimate: 0h > > adding validate runner for Dataflow runner v2 to Java SDK (Java harness) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-10776) Unwanted JDK jars staged when running cross-language pipelines
Chamikara Madhusanka Jayalath created BEAM-10776: Summary: Unwanted JDK jars staged when running cross-language pipelines Key: BEAM-10776 URL: https://issues.apache.org/jira/browse/BEAM-10776 Project: Beam Issue Type: Bug Components: cross-language Reporter: Chamikara Madhusanka Jayalath When running cross-language Kafka on Dataflow I see following jars being staged. INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar... INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar in 40 seconds. INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar... INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar in 177 seconds. INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar... INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar in 285 seconds. INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar... INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar in 0 seconds. INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar... INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar in 16 seconds. INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar... INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar in 0 seconds. INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar... INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar in 0 seconds. INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar... Out of these we just need 'beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar'. Rest seems to be due to us including all jars from classpath in the expansion service response. [https://github.com/apache/beam/blob/master/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionService.java#L407] We should figure out a way to filter out these additional jars. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10776) Unwanted JDK jars staged when running cross-language pipelines
[ https://issues.apache.org/jira/browse/BEAM-10776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181446#comment-17181446 ] Chamikara Madhusanka Jayalath commented on BEAM-10776: -- cc: [~heejong] [~robertwb] [~lcwik] any ideas on on a way to the correct set of jars to stage without additional jars from the classpath ? > Unwanted JDK jars staged when running cross-language pipelines > -- > > Key: BEAM-10776 > URL: https://issues.apache.org/jira/browse/BEAM-10776 > Project: Beam > Issue Type: Bug > Components: cross-language >Reporter: Chamikara Madhusanka Jayalath >Priority: P2 > > When running cross-language Kafka on Dataflow I see following jars being > staged. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar > in 40 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar > in 177 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar > in 285 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar > in 0 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar > in 16 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar > in 0 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar > in 0 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar... > > Out of these we just need > 'beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar'. > Rest seems to be due to us including all jars from classpath in the > expansion service response. > > [https://github.com/apache/beam/blob/master/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionService.java#L407] > > We should figure out a way to filter out these additional jars. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10698) SDFs broken for Dataflow runner v2 due to timestamps being out of bound
[ https://issues.apache.org/jira/browse/BEAM-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181966#comment-17181966 ] Chamikara Madhusanka Jayalath commented on BEAM-10698: -- Yes, this was resolved. > SDFs broken for Dataflow runner v2 due to timestamps being out of bound > --- > > Key: BEAM-10698 > URL: https://issues.apache.org/jira/browse/BEAM-10698 > Project: Beam > Issue Type: Bug > Components: cross-language, io-py-kafka, runner-dataflow >Reporter: Chamikara Madhusanka Jayalath >Assignee: Boyuan Zhang >Priority: P0 > Fix For: 2.24.0 > > Time Spent: 4.5h > Remaining Estimate: 0h > > Seems like this is a regression introduced by > [https://github.com/apache/beam/pull/11749] > I have a short term fix here: [https://github.com/apache/beam/pull/12557] > We should either. > (1) Include [https://github.com/apache/beam/pull/12557] in the release branch > (2) Find the underlying issue that produces wrong timestamps and do that fix > (3) Revert [https://github.com/apache/beam/pull/11749] from the 2.24.0 > release branch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10698) SDFs broken for Dataflow runner v2 due to timestamps being out of bound
[ https://issues.apache.org/jira/browse/BEAM-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-10698: - Status: Resolved (was: Triage Needed) > SDFs broken for Dataflow runner v2 due to timestamps being out of bound > --- > > Key: BEAM-10698 > URL: https://issues.apache.org/jira/browse/BEAM-10698 > Project: Beam > Issue Type: Bug > Components: cross-language, io-py-kafka, runner-dataflow >Reporter: Chamikara Madhusanka Jayalath >Assignee: Boyuan Zhang >Priority: P0 > Fix For: 2.24.0 > > Time Spent: 4.5h > Remaining Estimate: 0h > > Seems like this is a regression introduced by > [https://github.com/apache/beam/pull/11749] > I have a short term fix here: [https://github.com/apache/beam/pull/12557] > We should either. > (1) Include [https://github.com/apache/beam/pull/12557] in the release branch > (2) Find the underlying issue that produces wrong timestamps and do that fix > (3) Revert [https://github.com/apache/beam/pull/11749] from the 2.24.0 > release branch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10776) Unwanted JDK jars staged when running cross-language pipelines
[ https://issues.apache.org/jira/browse/BEAM-10776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182064#comment-17182064 ] Chamikara Madhusanka Jayalath commented on BEAM-10776: -- Thanks Luke. The issue might be Mac specific. For Linux I only see 'beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-' jar being staged as expected. I'm using JDK 8. Expansion services is started as a separate process. > Unwanted JDK jars staged when running cross-language pipelines > -- > > Key: BEAM-10776 > URL: https://issues.apache.org/jira/browse/BEAM-10776 > Project: Beam > Issue Type: Bug > Components: cross-language >Reporter: Chamikara Madhusanka Jayalath >Priority: P2 > > When running cross-language Kafka on Dataflow I see following jars being > staged. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar > in 40 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar > in 177 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar > in 285 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar > in 0 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar > in 16 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar > in 0 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar > in 0 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar... > > Out of these we just need > 'beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar'. > Rest seems to be due to us including all jars from classpath in the > expansion service response. > > [https://github.com/apache/beam/blob/master/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionService.java#L407] > > We should figure out a way to filter out these additional jars. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10776) Unwanted JDK jars staged when running cross-language pipelines
[ https://issues.apache.org/jira/browse/BEAM-10776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182198#comment-17182198 ] Chamikara Madhusanka Jayalath commented on BEAM-10776: -- I didn't run into any runtime issues. But it takes a long time to stage all the jars so user experience is pretty bad for Mac users it seems. > Unwanted JDK jars staged when running cross-language pipelines > -- > > Key: BEAM-10776 > URL: https://issues.apache.org/jira/browse/BEAM-10776 > Project: Beam > Issue Type: Bug > Components: cross-language >Reporter: Chamikara Madhusanka Jayalath >Priority: P2 > > When running cross-language Kafka on Dataflow I see following jars being > staged. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar > in 40 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar > in 177 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar > in 285 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar > in 0 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar > in 16 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar > in 0 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar > in 0 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar... > > Out of these we just need > 'beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar'. > Rest seems to be due to us including all jars from classpath in the > expansion service response. > > [https://github.com/apache/beam/blob/master/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionService.java#L407] > > We should figure out a way to filter out these additional jars. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10776) Unwanted JDK jars staged when running cross-language pipelines
[ https://issues.apache.org/jira/browse/BEAM-10776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182204#comment-17182204 ] Chamikara Madhusanka Jayalath commented on BEAM-10776: -- Does this mean that we always stage these additional jars for production Dataflow Java users today on Mac or is that path different somehow ? > Unwanted JDK jars staged when running cross-language pipelines > -- > > Key: BEAM-10776 > URL: https://issues.apache.org/jira/browse/BEAM-10776 > Project: Beam > Issue Type: Bug > Components: cross-language >Reporter: Chamikara Madhusanka Jayalath >Priority: P2 > > When running cross-language Kafka on Dataflow I see following jars being > staged. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar > in 40 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar > in 177 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar > in 285 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar > in 0 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar > in 16 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar > in 0 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar > in 0 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar... > > Out of these we just need > 'beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar'. > Rest seems to be due to us including all jars from classpath in the > expansion service response. > > [https://github.com/apache/beam/blob/master/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionService.java#L407] > > We should figure out a way to filter out these additional jars. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10776) Unwanted JDK jars staged when running cross-language pipelines
[ https://issues.apache.org/jira/browse/BEAM-10776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182205#comment-17182205 ] Chamikara Madhusanka Jayalath commented on BEAM-10776: -- (Yes, additional jars are not required. They are just picked up from the CLASSPATH and staged unnecessarily). > Unwanted JDK jars staged when running cross-language pipelines > -- > > Key: BEAM-10776 > URL: https://issues.apache.org/jira/browse/BEAM-10776 > Project: Beam > Issue Type: Bug > Components: cross-language >Reporter: Chamikara Madhusanka Jayalath >Priority: P2 > > When running cross-language Kafka on Dataflow I see following jars being > staged. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/nashorn-BJZNQ7N8Lsfq-WSM0IMsRCwFMC3RIxBOEjrlB1YwKOw.jar > in 40 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/cldrdata-aZ6XIS6LfPilqVFbS_bWm1wMWGm3jxtjh0vjlRuqp5M.jar > in 177 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jfxrt-B2UJQqvuEI-15FPV1mcdw80YRUIDMg1Kr82FxWK_DZ8.jar > in 285 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/dnsns-zNxWyUaaHIkUFJRt-aNZudjc3eroySNUeRkxdxidGbY.jar > in 0 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/localedata-Wt0bN9j6XmIH4BaRLouHZX6p6iIoQsbZ2AkomxZTOYM.jar > in 16 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/jaccess-5wlKULhaKWM_gmKVtH_QBwVqH4awlxxRdNNfz0z0Imw.jar > in 0 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar... > INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/MRJToolkit-jU5qhDBc0cNjn7g3yrGHYO78BRC09T-sE8Syqo9mRjg.jar > in 0 seconds. > INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to > gs://clouddfe-chamikara/temp/kafka-taxi-20200820-132559.1597955225.717180/beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar... > > Out of these we just need > 'beam-sdks-java-io-expansion-service-2.24.0-SNAPSHOT-A94br32q87Prj7b_mG4_kPEdz9NSJ-0NwgHWEwwU4Qc.jar'. > Rest seems to be due to us including all jars from classpath in the > expansion service response. > > [https://github.com/apache/beam/blob/master/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionService.java#L407] > > We should figure out a way to filter out these additional jars. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10411) Add an example for cross-language Kafka
[ https://issues.apache.org/jira/browse/BEAM-10411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-10411: - Status: Resolved (was: Open) > Add an example for cross-language Kafka > --- > > Key: BEAM-10411 > URL: https://issues.apache.org/jira/browse/BEAM-10411 > Project: Beam > Issue Type: Improvement > Components: io-py-kafka >Reporter: Chamikara Madhusanka Jayalath >Assignee: Chamikara Madhusanka Jayalath >Priority: P2 > Labels: stale-assigned > Time Spent: 3h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10507) support xlang custom window fn
[ https://issues.apache.org/jira/browse/BEAM-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184844#comment-17184844 ] Chamikara Madhusanka Jayalath commented on BEAM-10507: -- We should also add tests that use external standard and custom window fns to x-lang test suites. Seems like we don't have such tests currently: [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/validate_runner_xlang_test.py] > support xlang custom window fn > -- > > Key: BEAM-10507 > URL: https://issues.apache.org/jira/browse/BEAM-10507 > Project: Beam > Issue Type: Bug > Components: cross-language >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: P2 > > support xlang custom window fn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10663) Python postcommit fails after BEAM-9977 #11749 merge
[ https://issues.apache.org/jira/browse/BEAM-10663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186119#comment-17186119 ] Chamikara Madhusanka Jayalath commented on BEAM-10663: -- I think new SDF Kafka implementation introduced multiple regression for Dataflow. One was fixed by [https://github.com/apache/beam/pull/12585] but there's another dataloss error that has not been fixed yet. Given that 2.23.0 used SDF BoundedSource-wrapper I don't think using the same for 2.24.0 can not be considered a regression. IMO we should try to fix issues with new SDF Kafka implementation to enable it for 2.25.0. > Python postcommit fails after BEAM-9977 #11749 merge > > > Key: BEAM-10663 > URL: https://issues.apache.org/jira/browse/BEAM-10663 > Project: Beam > Issue Type: Bug > Components: cross-language, io-py-kafka, test-failures >Affects Versions: 2.24.0 >Reporter: Piotr Szuberski >Assignee: Boyuan Zhang >Priority: P0 > Fix For: 2.24.0 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Python postcommits fail on CrossLanguageKafkaIO python tests after #11749 > (BEAM-9977) merge. > > Fragment of stackstrace: > ``` > {{Caused by: java.util.concurrent.ExecutionException: > java.lang.RuntimeException: Error received from SDK harness for instruction > 2: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could > not find a way to create AutoValue class class > org.apache.beam.sdk.io.kafka.KafkaSourceDescriptor > at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) > at > org.apache.beam.sdk.fn.data.CompletableFutureInboundDataClient.awaitCompletion(CompletableFutureInboundDataClient.java:48) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.awaitCompletion(BeamFnDataInboundObserver.java:91) > at > org.apache.beam.fn.harness.BeamFnDataReadRunner.blockTillReadFinishes(BeamFnDataReadRunner.java:342) > at > org.apache.beam.fn.harness.data.PTransformFunctionRegistry.lambda$register$0(PTransformFunctionRegistry.java:108) > at > org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:302) > at > org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:173) > at > org.apache.beam.fn.harness.control.BeamFnControlClient.lambda$processInstructionRequests$0(BeamFnControlClient.java:157) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748)}} > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (BEAM-10663) Python postcommit fails after BEAM-9977 #11749 merge
[ https://issues.apache.org/jira/browse/BEAM-10663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186119#comment-17186119 ] Chamikara Madhusanka Jayalath edited comment on BEAM-10663 at 8/27/20, 9:37 PM: I think new SDF Kafka implementation introduced multiple regression for Dataflow. One was fixed by [https://github.com/apache/beam/pull/12585] but there's another dataloss error that has not been fixed yet. Given that 2.23.0 used SDF BoundedSource-wrapper I don't think using the same for 2.24.0 can be considered a regression. IMO we should try to fix issues with new SDF Kafka implementation to enable it for 2.25.0. was (Author: chamikara): I think new SDF Kafka implementation introduced multiple regression for Dataflow. One was fixed by [https://github.com/apache/beam/pull/12585] but there's another dataloss error that has not been fixed yet. Given that 2.23.0 used SDF BoundedSource-wrapper I don't think using the same for 2.24.0 can not be considered a regression. IMO we should try to fix issues with new SDF Kafka implementation to enable it for 2.25.0. > Python postcommit fails after BEAM-9977 #11749 merge > > > Key: BEAM-10663 > URL: https://issues.apache.org/jira/browse/BEAM-10663 > Project: Beam > Issue Type: Bug > Components: cross-language, io-py-kafka, test-failures >Affects Versions: 2.24.0 >Reporter: Piotr Szuberski >Assignee: Boyuan Zhang >Priority: P0 > Fix For: 2.24.0 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Python postcommits fail on CrossLanguageKafkaIO python tests after #11749 > (BEAM-9977) merge. > > Fragment of stackstrace: > ``` > {{Caused by: java.util.concurrent.ExecutionException: > java.lang.RuntimeException: Error received from SDK harness for instruction > 2: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could > not find a way to create AutoValue class class > org.apache.beam.sdk.io.kafka.KafkaSourceDescriptor > at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) > at > org.apache.beam.sdk.fn.data.CompletableFutureInboundDataClient.awaitCompletion(CompletableFutureInboundDataClient.java:48) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.awaitCompletion(BeamFnDataInboundObserver.java:91) > at > org.apache.beam.fn.harness.BeamFnDataReadRunner.blockTillReadFinishes(BeamFnDataReadRunner.java:342) > at > org.apache.beam.fn.harness.data.PTransformFunctionRegistry.lambda$register$0(PTransformFunctionRegistry.java:108) > at > org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:302) > at > org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:173) > at > org.apache.beam.fn.harness.control.BeamFnControlClient.lambda$processInstructionRequests$0(BeamFnControlClient.java:157) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748)}} > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10869) Pubsub native write should be supported over fnapi
[ https://issues.apache.org/jira/browse/BEAM-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-10869: - Priority: P0 (was: P2) > Pubsub native write should be supported over fnapi > -- > > Key: BEAM-10869 > URL: https://issues.apache.org/jira/browse/BEAM-10869 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Boyuan Zhang >Priority: P0 > Time Spent: 15h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10869) Pubsub native write should be supported over fnapi
[ https://issues.apache.org/jira/browse/BEAM-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-10869: - Fix Version/s: 2.26.0 > Pubsub native write should be supported over fnapi > -- > > Key: BEAM-10869 > URL: https://issues.apache.org/jira/browse/BEAM-10869 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Boyuan Zhang >Priority: P0 > Fix For: 2.26.0 > > Time Spent: 15h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10869) Pubsub native write should be supported over fnapi
[ https://issues.apache.org/jira/browse/BEAM-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17225023#comment-17225023 ] Chamikara Madhusanka Jayalath commented on BEAM-10869: -- I believe the remaining work here is updating the runner API proto. This should be a blocker for 2.26.0 since we should try to finalize the proto before it's released. I'm looking into this but currently running into an issue due to internal tests failing for Dataflow Runner v2. > Pubsub native write should be supported over fnapi > -- > > Key: BEAM-10869 > URL: https://issues.apache.org/jira/browse/BEAM-10869 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Boyuan Zhang >Priority: P0 > Fix For: 2.26.0 > > Time Spent: 15h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-10869) Pubsub native write should be supported over fnapi
[ https://issues.apache.org/jira/browse/BEAM-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath reassigned BEAM-10869: Assignee: Chamikara Madhusanka Jayalath > Pubsub native write should be supported over fnapi > -- > > Key: BEAM-10869 > URL: https://issues.apache.org/jira/browse/BEAM-10869 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Boyuan Zhang >Assignee: Chamikara Madhusanka Jayalath >Priority: P0 > Fix For: 2.26.0 > > Time Spent: 15h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-11129) Populate display data in Beam protos.
[ https://issues.apache.org/jira/browse/BEAM-11129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-11129: - Priority: P1 (was: P0) > Populate display data in Beam protos. > - > > Key: BEAM-11129 > URL: https://issues.apache.org/jira/browse/BEAM-11129 > Project: Beam > Issue Type: Bug > Components: beam-model, sdk-java-core, sdk-py-core >Reporter: Robert Bradshaw >Priority: P1 > Fix For: 2.26.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-11129) Populate display data in Beam protos.
[ https://issues.apache.org/jira/browse/BEAM-11129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-11129: - Fix Version/s: (was: 2.26.0) > Populate display data in Beam protos. > - > > Key: BEAM-11129 > URL: https://issues.apache.org/jira/browse/BEAM-11129 > Project: Beam > Issue Type: Bug > Components: beam-model, sdk-java-core, sdk-py-core >Reporter: Robert Bradshaw >Priority: P1 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-11129) Populate display data in Beam protos.
[ https://issues.apache.org/jira/browse/BEAM-11129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17225035#comment-17225035 ] Chamikara Madhusanka Jayalath commented on BEAM-11129: -- I don't believe this is a blocker for 2.26.0. > Populate display data in Beam protos. > - > > Key: BEAM-11129 > URL: https://issues.apache.org/jira/browse/BEAM-11129 > Project: Beam > Issue Type: Bug > Components: beam-model, sdk-java-core, sdk-py-core >Reporter: Robert Bradshaw >Priority: P0 > Fix For: 2.26.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10869) Pubsub native write should be supported over fnapi
[ https://issues.apache.org/jira/browse/BEAM-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-10869: - Fix Version/s: (was: 2.26.0) > Pubsub native write should be supported over fnapi > -- > > Key: BEAM-10869 > URL: https://issues.apache.org/jira/browse/BEAM-10869 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Boyuan Zhang >Assignee: Chamikara Madhusanka Jayalath >Priority: P0 > Time Spent: 15h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10869) Pubsub native write should be supported over fnapi
[ https://issues.apache.org/jira/browse/BEAM-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-10869: - Priority: P1 (was: P0) > Pubsub native write should be supported over fnapi > -- > > Key: BEAM-10869 > URL: https://issues.apache.org/jira/browse/BEAM-10869 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Boyuan Zhang >Assignee: Chamikara Madhusanka Jayalath >Priority: P1 > Time Spent: 15h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10869) Pubsub native write should be supported over fnapi
[ https://issues.apache.org/jira/browse/BEAM-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17225865#comment-17225865 ] Chamikara Madhusanka Jayalath commented on BEAM-10869: -- I believe P0 work is done here. Assigning to Boyuan to close if this is done. > Pubsub native write should be supported over fnapi > -- > > Key: BEAM-10869 > URL: https://issues.apache.org/jira/browse/BEAM-10869 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Boyuan Zhang >Assignee: Chamikara Madhusanka Jayalath >Priority: P1 > Time Spent: 15h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-10869) Pubsub native write should be supported over fnapi
[ https://issues.apache.org/jira/browse/BEAM-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath reassigned BEAM-10869: Assignee: Boyuan Zhang (was: Chamikara Madhusanka Jayalath) > Pubsub native write should be supported over fnapi > -- > > Key: BEAM-10869 > URL: https://issues.apache.org/jira/browse/BEAM-10869 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Boyuan Zhang >Assignee: Boyuan Zhang >Priority: P1 > Time Spent: 15h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-11033) Update Dataflow metrics processor to handle portable jobs
[ https://issues.apache.org/jira/browse/BEAM-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-11033: - Priority: P1 (was: P2) > Update Dataflow metrics processor to handle portable jobs > - > > Key: BEAM-11033 > URL: https://issues.apache.org/jira/browse/BEAM-11033 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Chamikara Madhusanka Jayalath >Priority: P1 > Time Spent: 2.5h > Remaining Estimate: 0h > > Currently, Dataflow metrics processor expects Dataflow internal step names > generated for v1beta3 job description in metrics returned by Dataflow > service: > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/dataflow_metrics.py#L97] > > But with portable job submission, Dataflow uses PTransform ID (in proto > pipeline) as the internal step name. Hence metrics processor should be > updated to handle this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (BEAM-11033) Update Dataflow metrics processor to handle portable jobs
[ https://issues.apache.org/jira/browse/BEAM-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath reopened BEAM-11033: -- > Update Dataflow metrics processor to handle portable jobs > - > > Key: BEAM-11033 > URL: https://issues.apache.org/jira/browse/BEAM-11033 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Chamikara Madhusanka Jayalath >Priority: P2 > Time Spent: 2.5h > Remaining Estimate: 0h > > Currently, Dataflow metrics processor expects Dataflow internal step names > generated for v1beta3 job description in metrics returned by Dataflow > service: > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/dataflow_metrics.py#L97] > > But with portable job submission, Dataflow uses PTransform ID (in proto > pipeline) as the internal step name. Hence metrics processor should be > updated to handle this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-11033) Update Dataflow metrics processor to handle portable jobs
[ https://issues.apache.org/jira/browse/BEAM-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-11033: - Fix Version/s: 2.26.0 > Update Dataflow metrics processor to handle portable jobs > - > > Key: BEAM-11033 > URL: https://issues.apache.org/jira/browse/BEAM-11033 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Chamikara Madhusanka Jayalath >Priority: P1 > Fix For: 2.26.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > Currently, Dataflow metrics processor expects Dataflow internal step names > generated for v1beta3 job description in metrics returned by Dataflow > service: > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/dataflow_metrics.py#L97] > > But with portable job submission, Dataflow uses PTransform ID (in proto > pipeline) as the internal step name. Hence metrics processor should be > updated to handle this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-11254) Add documentation for authoring and using multi-language pipelines
[ https://issues.apache.org/jira/browse/BEAM-11254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-11254: - Summary: Add documentation for authoring and using multi-language pipelines (was: Add documentation for authoring and using cross-language pipelines) > Add documentation for authoring and using multi-language pipelines > -- > > Key: BEAM-11254 > URL: https://issues.apache.org/jira/browse/BEAM-11254 > Project: Beam > Issue Type: New Feature > Components: cross-language >Reporter: Chamikara Madhusanka Jayalath >Priority: P2 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-11254) Add documentation for authoring and using cross-language pipelines
Chamikara Madhusanka Jayalath created BEAM-11254: Summary: Add documentation for authoring and using cross-language pipelines Key: BEAM-11254 URL: https://issues.apache.org/jira/browse/BEAM-11254 Project: Beam Issue Type: New Feature Components: cross-language Reporter: Chamikara Madhusanka Jayalath -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-11033) Update Dataflow metrics processor to handle portable jobs
[ https://issues.apache.org/jira/browse/BEAM-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-11033: - Status: Resolved (was: Open) > Update Dataflow metrics processor to handle portable jobs > - > > Key: BEAM-11033 > URL: https://issues.apache.org/jira/browse/BEAM-11033 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Chamikara Madhusanka Jayalath >Priority: P1 > Fix For: 2.26.0 > > Time Spent: 4h 50m > Remaining Estimate: 0h > > Currently, Dataflow metrics processor expects Dataflow internal step names > generated for v1beta3 job description in metrics returned by Dataflow > service: > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/dataflow_metrics.py#L97] > > But with portable job submission, Dataflow uses PTransform ID (in proto > pipeline) as the internal step name. Hence metrics processor should be > updated to handle this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-6807) Implement an Azure blobstore filesystem for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-6807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231821#comment-17231821 ] Chamikara Madhusanka Jayalath commented on BEAM-6807: - Can we resolve this since [https://github.com/apache/beam/pull/12492] was merged ? > Implement an Azure blobstore filesystem for Python SDK > -- > > Key: BEAM-6807 > URL: https://issues.apache.org/jira/browse/BEAM-6807 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Pablo Estrada >Assignee: Aldair Coronel >Priority: P2 > Labels: GSoC2019, azure, azureblob, gsoc, gsoc2019, gsoc2020, > mentor > Time Spent: 6h 10m > Remaining Estimate: 0h > > This is similar to BEAM-2572, but for Azure's blobstore. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-7925) ParquetIO supports neither column projection nor filter predicate
[ https://issues.apache.org/jira/browse/BEAM-7925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231858#comment-17231858 ] Chamikara Madhusanka Jayalath commented on BEAM-7925: - Can this Jira be resolved since [https://github.com/apache/beam/pull/12786] went in ? cc: [~heejong] [~danielxjd] > ParquetIO supports neither column projection nor filter predicate > - > > Key: BEAM-7925 > URL: https://issues.apache.org/jira/browse/BEAM-7925 > Project: Beam > Issue Type: Improvement > Components: io-java-parquet >Affects Versions: 2.14.0 >Reporter: Neville Li >Priority: P3 > Time Spent: 3h 20m > Remaining Estimate: 0h > > Current {{ParquetIO}} supports neither column projection nor filter predicate > which defeats the performance motivation of using Parquet in the first place. > That's why we have our own implementation of > [ParquetIO|https://github.com/spotify/scio/tree/master/scio-parquet/src] in > Scio. > Reading Parquet as Avro with column projection has some complications, > namely, the resulting Avro records may be incomplete and will not survive > ser/de. A workaround maybe provide a {{TypedRead}} interface that takes a > {{Function}} that maps invalid Avro {{A}} into user defined type {{B}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10397) add missing environment in windowing strategy for Dataflow
[ https://issues.apache.org/jira/browse/BEAM-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-10397: - Fix Version/s: 2.23.0 > add missing environment in windowing strategy for Dataflow > -- > > Key: BEAM-10397 > URL: https://issues.apache.org/jira/browse/BEAM-10397 > Project: Beam > Issue Type: Bug > Components: cross-language, runner-dataflow >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: P2 > Fix For: 2.23.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > add missing environment in windowing strategy for Dataflow -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10397) add missing environment in windowing strategy for Dataflow
[ https://issues.apache.org/jira/browse/BEAM-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-10397: - Priority: P1 (was: P2) > add missing environment in windowing strategy for Dataflow > -- > > Key: BEAM-10397 > URL: https://issues.apache.org/jira/browse/BEAM-10397 > Project: Beam > Issue Type: Bug > Components: cross-language, runner-dataflow >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: P1 > Fix For: 2.23.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > add missing environment in windowing strategy for Dataflow -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10507) support xlang custom window fn
[ https://issues.apache.org/jira/browse/BEAM-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-10507: - Priority: P1 (was: P2) > support xlang custom window fn > -- > > Key: BEAM-10507 > URL: https://issues.apache.org/jira/browse/BEAM-10507 > Project: Beam > Issue Type: Bug > Components: cross-language >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: P1 > > support xlang custom window fn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10507) support xlang custom window fn
[ https://issues.apache.org/jira/browse/BEAM-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-10507: - Priority: P2 (was: P1) > support xlang custom window fn > -- > > Key: BEAM-10507 > URL: https://issues.apache.org/jira/browse/BEAM-10507 > Project: Beam > Issue Type: Bug > Components: cross-language >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: P2 > > support xlang custom window fn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-10511) Allow users to specify the temporary dataset for BQ
Chamikara Madhusanka Jayalath created BEAM-10511: Summary: Allow users to specify the temporary dataset for BQ Key: BEAM-10511 URL: https://issues.apache.org/jira/browse/BEAM-10511 Project: Beam Issue Type: Improvement Components: io-py-gcp Reporter: Chamikara Madhusanka Jayalath Assignee: Pablo Estrada We recently added this to Java: https://issues.apache.org/jira/browse/BEAM-8458 We should add this to Python custom source as well. Pablo, assigning to you for now. I suspect this will be a relatively small change for the custom source. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-6868) Flink runner supports Bundle Finalization
[ https://issues.apache.org/jira/browse/BEAM-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161563#comment-17161563 ] Chamikara Madhusanka Jayalath commented on BEAM-6868: - [~boyuanz] is this a regression ? X-lang KafkaIO used to work for portable Flink/Spark AFAIK. (also seems like this is high priority than P3) > Flink runner supports Bundle Finalization > - > > Key: BEAM-6868 > URL: https://issues.apache.org/jira/browse/BEAM-6868 > Project: Beam > Issue Type: New Feature > Components: runner-flink >Reporter: Boyuan Zhang >Priority: P3 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-6868) Flink runner supports Bundle Finalization
[ https://issues.apache.org/jira/browse/BEAM-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-6868: Priority: P1 (was: P3) > Flink runner supports Bundle Finalization > - > > Key: BEAM-6868 > URL: https://issues.apache.org/jira/browse/BEAM-6868 > Project: Beam > Issue Type: New Feature > Components: runner-flink >Reporter: Boyuan Zhang >Priority: P1 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (BEAM-6868) Flink runner supports Bundle Finalization
[ https://issues.apache.org/jira/browse/BEAM-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161563#comment-17161563 ] Chamikara Madhusanka Jayalath edited comment on BEAM-6868 at 7/20/20, 9:46 PM: --- [~boyuanz] and [~mxm] is this a regression ? X-lang KafkaIO used to work for portable Flink/Spark AFAIK. (also seems like this is high priority than P3) was (Author: chamikara): [~boyuanz] is this a regression ? X-lang KafkaIO used to work for portable Flink/Spark AFAIK. (also seems like this is high priority than P3) > Flink runner supports Bundle Finalization > - > > Key: BEAM-6868 > URL: https://issues.apache.org/jira/browse/BEAM-6868 > Project: Beam > Issue Type: New Feature > Components: runner-flink >Reporter: Boyuan Zhang >Priority: P1 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-6868) Flink runner supports Bundle Finalization
[ https://issues.apache.org/jira/browse/BEAM-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-6868: Component/s: cross-language > Flink runner supports Bundle Finalization > - > > Key: BEAM-6868 > URL: https://issues.apache.org/jira/browse/BEAM-6868 > Project: Beam > Issue Type: New Feature > Components: cross-language, runner-flink >Reporter: Boyuan Zhang >Priority: P1 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-6868) Flink runner supports Bundle Finalization
[ https://issues.apache.org/jira/browse/BEAM-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164050#comment-17164050 ] Chamikara Madhusanka Jayalath commented on BEAM-6868: - I think users are running into this. For example, see [https://lists.apache.org/thread.html/re53d28276def9593dc5f28cbd9710d141aa0b68e2e0486d401724398%40%3Cdev.beam.apache.org%3E] > Flink runner supports Bundle Finalization > - > > Key: BEAM-6868 > URL: https://issues.apache.org/jira/browse/BEAM-6868 > Project: Beam > Issue Type: New Feature > Components: cross-language, runner-flink >Reporter: Boyuan Zhang >Priority: P1 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10593) Publish SNAPSHOT SDK harness containers nightly
[ https://issues.apache.org/jira/browse/BEAM-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-10593: - Component/s: cross-language > Publish SNAPSHOT SDK harness containers nightly > --- > > Key: BEAM-10593 > URL: https://issues.apache.org/jira/browse/BEAM-10593 > Project: Beam > Issue Type: Task > Components: cross-language, sdk-java-harness, sdk-py-harness >Reporter: Brian Hulette >Priority: P2 > > Facilitates testing distributed runners at HEAD. > Discussed on > [dev@|https://lists.apache.org/thread.html/rd26405613c9b7b53c860080a9372e54aa5f3410e5be0db8ca414ec07%40%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10011) Test Python SqlTransform on Dataflow
[ https://issues.apache.org/jira/browse/BEAM-10011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-10011: - Component/s: cross-language > Test Python SqlTransform on Dataflow > > > Key: BEAM-10011 > URL: https://issues.apache.org/jira/browse/BEAM-10011 > Project: Beam > Issue Type: Improvement > Components: cross-language, sdk-py-core >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: P2 > > We should run sql_test.py on Dataflow. Also need to make sure Java VR suite > is run on Dataflow runner v2 (to verify the expanded SQL transforms will work > there). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status
[ https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath reassigned BEAM-10248: Assignee: Chamikara Madhusanka Jayalath > Beam does not set correct region for BigQuery when requesting load job status > - > > Key: BEAM-10248 > URL: https://issues.apache.org/jira/browse/BEAM-10248 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.22.0 > Environment: -Beam 2.22.0 > -DirectRunner & DataflowRunner > -Java 11 >Reporter: Graham Polley >Assignee: Chamikara Madhusanka Jayalath >Priority: P1 > Attachments: Untitled document.pdf > > > I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to > BigQuery. My BigQuery dataset is in region `australia-southeast1`. > If the load job into BigQuery fails for some reason (e.g. table does not > exist, or schema has changed), then an error from BigQuery is returned for > the load. > Beam then enters into a loop of retrying the load job. However, instead of > using a new job id (where the suffix is incremented by 1), it wrongly tries > to reinsert the job using the same job id. This is because when it tries to > look up if the job id already exists, it does not take into consideration the > region where the dataset is. Instead, it defaults to the US but the job was > created in `australia-southeast1`, so it returns `null`. > Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It > returns `null` because it's looking in the wrong region. > If I *test* using a dataset in BigQuery that is in the US region, then it > correctly finds the job id and begins to retry the job with a new job id > (suffixed with the retry count). We have seen this bug in other areas of Beam > before and in other tools/services on GCP e.g. Cloud Composer. > However, that leads me to my next problem/bug. Even if that is fixed, the > number of retries is set to `Integer.MX_VALUE`, so Beam will keep retrying > the job 2,147,483,647 times. This is not good. > The exception is swallowed up by the Beam SDK and never propagated back up > the stack for users to catch and handle. So, if a load job fails there is no > way to handle it and react for users. > I attempted to set `withFailedInsertRetryPolicy()` to transient errors only, > but it is not supported with `FILE_LOADS`. I also tried, using the > `WriteResult` object returned from the bigQuery sink/write, to get a handle > on the error but it does not work. Users need a way to catch and catch failed > load jobs when using `FILE_LOADS`. > > {code:java} > public class TemplatePipeline { > private static final String TOPIC = > "projects/etl-demo-269105/topics/test-micro-batching"; > private static final String BIGQUERY_DESTINATION_FILTERED = > "etl-demo-269105:etl_demo_fun.micro_batch_test_xxx"; > public static void main(String[] args) throws Exception { > try { > PipelineOptionsFactory.register(DataflowPipelineOptions.class); > DataflowPipelineOptions options = PipelineOptionsFactory > .fromArgs(args) > .withoutStrictParsing() > .as(DataflowPipelineOptions.class); > Pipeline pipeline = Pipeline.create(options); > PCollection messages = pipeline > .apply(PubsubIO.readMessagesWithAttributes().fromTopic(String.format(TOPIC, > options.getProject( > .apply(Window.into(FixedWindows.of(Duration.standardSeconds(1; > WriteResult result = messages.apply(ParDo.of(new RowToBQRow())) > .apply(BigQueryIO.writeTableRows() > .to(String.format(BIGQUERY_DESTINATION_FILTERED, options.getProject())) > .withCreateDisposition(CREATE_NEVER) > .withWriteDisposition(WRITE_APPEND) > .withMethod(BigQueryIO.Write.Method.FILE_LOADS) > .withTriggeringFrequency(Duration.standardSeconds(5)) > .withNumFileShards(1) > .withExtendedErrorInfo() > .withSchema(getTableSchema())); > // result.getFailedInsertsWithErr().apply(ParDo.of(new > DoFn() { > // @ProcessElement > // public void processElement(ProcessContext c) { > // for(ErrorProto err : c.element().getError().getErrors()){ > // throw new RuntimeException(err.getMessage()); > // } > // > // } > // })); > result.getFailedInserts().apply(ParDo.of(new DoFn() { > @ProcessElement > public void processElement(ProcessContext c) { > System.out.println(c.element()); > c.output("foo"); > throw new RuntimeException("Failed to load"); > } > @FinishBundle > public void finishUp(FinishBundleContext finishBundleContextc){ > System.out.println("Got here"); > } > })); > pipeline.run(); > } catch (Exception e) { > e.printStackTrace(); > throw new Exception(e); > } > } > private static TableSchema getTableSchema() { > List fields = new ArrayList<>(); > fields.add(new TableField
[jira] [Comment Edited] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status
[ https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167389#comment-17167389 ] Chamikara Madhusanka Jayalath edited comment on BEAM-10248 at 7/29/20, 5:39 PM: There were three separate concerns here. (1) Not setting the correct region when checking the status of the job does seem like a bug. (2) "Even if that is fixed, the number of retries is set to `Integer.MX_VALUE`" I'm not sure if this analysis is correct. Number of status poll attempts is set to Integer.MX_VALUE but the number of retries of the job seems to be coming from below which is set to 3. [https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L449] [https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L116] Note though that for streaming jobs, Dataflow will keep trying failed workitems forever. So essentially this BQ job will be attempted forever without the job failing. This is just the way Dataflow streaming runner operates. (3) "So, if a load job fails there is no way to handle it and react for users." This is a known limitation and seems to be separate from the intermediate issue here. Currently withFailedInsertRetryPolicy() and WriteResult is only supported for streaming inserts. But we should be logging job failures in worker logs. was (Author: chamikara): Not setting the correct region when checking the status of the job does seem like a bug. However, that leads me to my next problem/bug. Even if that is fixed, the number of retries is set to `Integer.MX_VALUE`: I'm not sure if this is correct. Number of status poll attempts is set to Integer.MX_VALUE but the number of retries of the job seems to be coming from below which is set to 3. [https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L449] [https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L116] Note though that for streaming jobs, Dataflow will keep trying failed workitems forever. So essentially this BQ job will be attempted forever without the job failing. This is just the way Dataflow streaming runner operates. "So, if a load job fails there is no way to handle it and react for users.": This is a known limitation and seems to be separate from the intermediate issue here. Currently withFailedInsertRetryPolicy() and WriteResult is only supported for streaming inserts. But we should be logging job failures in worker logs. > Beam does not set correct region for BigQuery when requesting load job status > - > > Key: BEAM-10248 > URL: https://issues.apache.org/jira/browse/BEAM-10248 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.22.0 > Environment: -Beam 2.22.0 > -DirectRunner & DataflowRunner > -Java 11 >Reporter: Graham Polley >Assignee: Chamikara Madhusanka Jayalath >Priority: P1 > Attachments: Untitled document.pdf > > > I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to > BigQuery. My BigQuery dataset is in region `australia-southeast1`. > If the load job into BigQuery fails for some reason (e.g. table does not > exist, or schema has changed), then an error from BigQuery is returned for > the load. > Beam then enters into a loop of retrying the load job. However, instead of > using a new job id (where the suffix is incremented by 1), it wrongly tries > to reinsert the job using the same job id. This is because when it tries to > look up if the job id already exists, it does not take into consideration the > region where the dataset is. Instead, it defaults to the US but the job was > created in `australia-southeast1`, so it returns `null`. > Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It > returns `null` because it's looking in the wrong region. > If I *test* using a dataset in BigQuery that is in the US region, then it > correctly finds the job id and begins to retry the job with a new job id > (suffixed with the retry count). We have seen this bug in other areas of Beam > before and in other tools/services on GCP e.g. Cloud Composer. > However, that leads me to my next problem/bug. Even if that is fixed, the > number of retries is set to `Integer.MX_VALUE`
[jira] [Commented] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status
[ https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167389#comment-17167389 ] Chamikara Madhusanka Jayalath commented on BEAM-10248: -- Not setting the correct region when checking the status of the job does seem like a bug. However, that leads me to my next problem/bug. Even if that is fixed, the number of retries is set to `Integer.MX_VALUE`: I'm not sure if this is correct. Number of status poll attempts is set to Integer.MX_VALUE but the number of retries of the job seems to be coming from below which is set to 3. [https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L449] [https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L116] Note though that for streaming jobs, Dataflow will keep trying failed workitems forever. So essentially this BQ job will be attempted forever without the job failing. This is just the way Dataflow streaming runner operates. "So, if a load job fails there is no way to handle it and react for users.": This is a known limitation and seems to be separate from the intermediate issue here. Currently withFailedInsertRetryPolicy() and WriteResult is only supported for streaming inserts. But we should be logging job failures in worker logs. > Beam does not set correct region for BigQuery when requesting load job status > - > > Key: BEAM-10248 > URL: https://issues.apache.org/jira/browse/BEAM-10248 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.22.0 > Environment: -Beam 2.22.0 > -DirectRunner & DataflowRunner > -Java 11 >Reporter: Graham Polley >Assignee: Chamikara Madhusanka Jayalath >Priority: P1 > Attachments: Untitled document.pdf > > > I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to > BigQuery. My BigQuery dataset is in region `australia-southeast1`. > If the load job into BigQuery fails for some reason (e.g. table does not > exist, or schema has changed), then an error from BigQuery is returned for > the load. > Beam then enters into a loop of retrying the load job. However, instead of > using a new job id (where the suffix is incremented by 1), it wrongly tries > to reinsert the job using the same job id. This is because when it tries to > look up if the job id already exists, it does not take into consideration the > region where the dataset is. Instead, it defaults to the US but the job was > created in `australia-southeast1`, so it returns `null`. > Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It > returns `null` because it's looking in the wrong region. > If I *test* using a dataset in BigQuery that is in the US region, then it > correctly finds the job id and begins to retry the job with a new job id > (suffixed with the retry count). We have seen this bug in other areas of Beam > before and in other tools/services on GCP e.g. Cloud Composer. > However, that leads me to my next problem/bug. Even if that is fixed, the > number of retries is set to `Integer.MX_VALUE`, so Beam will keep retrying > the job 2,147,483,647 times. This is not good. > The exception is swallowed up by the Beam SDK and never propagated back up > the stack for users to catch and handle. So, if a load job fails there is no > way to handle it and react for users. > I attempted to set `withFailedInsertRetryPolicy()` to transient errors only, > but it is not supported with `FILE_LOADS`. I also tried, using the > `WriteResult` object returned from the bigQuery sink/write, to get a handle > on the error but it does not work. Users need a way to catch and catch failed > load jobs when using `FILE_LOADS`. > > {code:java} > public class TemplatePipeline { > private static final String TOPIC = > "projects/etl-demo-269105/topics/test-micro-batching"; > private static final String BIGQUERY_DESTINATION_FILTERED = > "etl-demo-269105:etl_demo_fun.micro_batch_test_xxx"; > public static void main(String[] args) throws Exception { > try { > PipelineOptionsFactory.register(DataflowPipelineOptions.class); > DataflowPipelineOptions options = PipelineOptionsFactory > .fromArgs(args) > .withoutStrictParsing() > .as(DataflowPipelineOptions.class); > Pipeline pipeline = Pipeline.create(options); > PCollection messages = pipeline > .apply(PubsubIO.readMessagesWithAttributes().fromTopic(String.format(TOPIC, > options.getProject( > .apply(Window.into(FixedWindows.of(Duration.standardSecond
[jira] [Comment Edited] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status
[ https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167389#comment-17167389 ] Chamikara Madhusanka Jayalath edited comment on BEAM-10248 at 7/29/20, 5:39 PM: There were three separate concerns here. (1) Not setting the correct region when checking the status of the job does seem like a bug. (2) "Even if that is fixed, the number of retries is set to `Integer.MX_VALUE`" I'm not sure if this analysis is correct. Number of status poll attempts is set to Integer.MX_VALUE but the number of retries of the job seems to be coming from below which is set to 3. [https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L449] [https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L116] Note though that for streaming jobs, Dataflow will keep trying failed workitems forever. So essentially this BQ job will be attempted forever without the job failing. This is just the way Dataflow streaming runner operates. (3) "So, if a load job fails there is no way to handle it and react for users." This is a known limitation and seems to be separate from the intermediate issue here. Currently withFailedInsertRetryPolicy() and WriteResult is only supported for streaming inserts. But we should be logging job failures in worker logs. was (Author: chamikara): There were three separate concerns here. (1) Not setting the correct region when checking the status of the job does seem like a bug. (2) "Even if that is fixed, the number of retries is set to `Integer.MX_VALUE`" I'm not sure if this analysis is correct. Number of status poll attempts is set to Integer.MX_VALUE but the number of retries of the job seems to be coming from below which is set to 3. [https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L449] [https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L116] Note though that for streaming jobs, Dataflow will keep trying failed workitems forever. So essentially this BQ job will be attempted forever without the job failing. This is just the way Dataflow streaming runner operates. (3) "So, if a load job fails there is no way to handle it and react for users." This is a known limitation and seems to be separate from the intermediate issue here. Currently withFailedInsertRetryPolicy() and WriteResult is only supported for streaming inserts. But we should be logging job failures in worker logs. > Beam does not set correct region for BigQuery when requesting load job status > - > > Key: BEAM-10248 > URL: https://issues.apache.org/jira/browse/BEAM-10248 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.22.0 > Environment: -Beam 2.22.0 > -DirectRunner & DataflowRunner > -Java 11 >Reporter: Graham Polley >Assignee: Chamikara Madhusanka Jayalath >Priority: P1 > Attachments: Untitled document.pdf > > > I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to > BigQuery. My BigQuery dataset is in region `australia-southeast1`. > If the load job into BigQuery fails for some reason (e.g. table does not > exist, or schema has changed), then an error from BigQuery is returned for > the load. > Beam then enters into a loop of retrying the load job. However, instead of > using a new job id (where the suffix is incremented by 1), it wrongly tries > to reinsert the job using the same job id. This is because when it tries to > look up if the job id already exists, it does not take into consideration the > region where the dataset is. Instead, it defaults to the US but the job was > created in `australia-southeast1`, so it returns `null`. > Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It > returns `null` because it's looking in the wrong region. > If I *test* using a dataset in BigQuery that is in the US region, then it > correctly finds the job id and begins to retry the job with a new job id > (suffixed with the retry count). We have seen this bug in other areas of Beam > before and in other tools/services on GCP e.g. Cloud Composer. > However, that leads me to my next problem/bug. Even if that is fixed, the > number of retries is set t
[jira] [Comment Edited] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status
[ https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167389#comment-17167389 ] Chamikara Madhusanka Jayalath edited comment on BEAM-10248 at 7/29/20, 6:06 PM: There were three separate concerns here. (1) Not setting the correct region when checking the status of the job does seem like a bug. (2) "Even if that is fixed, the number of retries is set to `Integer.MX_VALUE`" I'm not sure if this analysis is correct. Number of status poll attempts is set to Integer.MX_VALUE but the number of retries of the job seems to be coming from below which is set to 3. [https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L449] [https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L116] Note though that for streaming jobs, Dataflow will keep trying failed workitems forever. So essentially this BQ job will be attempted forever without the Dataflow job failing. This is just the way Dataflow streaming runner operates. (3) "So, if a load job fails there is no way to handle it and react for users." This is a known limitation and seems to be separate from the intermediate issue here. Currently withFailedInsertRetryPolicy() and WriteResult is only supported for streaming inserts. But we should be logging job failures in worker logs. was (Author: chamikara): There were three separate concerns here. (1) Not setting the correct region when checking the status of the job does seem like a bug. (2) "Even if that is fixed, the number of retries is set to `Integer.MX_VALUE`" I'm not sure if this analysis is correct. Number of status poll attempts is set to Integer.MX_VALUE but the number of retries of the job seems to be coming from below which is set to 3. [https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L449] [https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L116] Note though that for streaming jobs, Dataflow will keep trying failed workitems forever. So essentially this BQ job will be attempted forever without the job failing. This is just the way Dataflow streaming runner operates. (3) "So, if a load job fails there is no way to handle it and react for users." This is a known limitation and seems to be separate from the intermediate issue here. Currently withFailedInsertRetryPolicy() and WriteResult is only supported for streaming inserts. But we should be logging job failures in worker logs. > Beam does not set correct region for BigQuery when requesting load job status > - > > Key: BEAM-10248 > URL: https://issues.apache.org/jira/browse/BEAM-10248 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.22.0 > Environment: -Beam 2.22.0 > -DirectRunner & DataflowRunner > -Java 11 >Reporter: Graham Polley >Assignee: Chamikara Madhusanka Jayalath >Priority: P1 > Attachments: Untitled document.pdf > > > I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to > BigQuery. My BigQuery dataset is in region `australia-southeast1`. > If the load job into BigQuery fails for some reason (e.g. table does not > exist, or schema has changed), then an error from BigQuery is returned for > the load. > Beam then enters into a loop of retrying the load job. However, instead of > using a new job id (where the suffix is incremented by 1), it wrongly tries > to reinsert the job using the same job id. This is because when it tries to > look up if the job id already exists, it does not take into consideration the > region where the dataset is. Instead, it defaults to the US but the job was > created in `australia-southeast1`, so it returns `null`. > Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It > returns `null` because it's looking in the wrong region. > If I *test* using a dataset in BigQuery that is in the US region, then it > correctly finds the job id and begins to retry the job with a new job id > (suffixed with the retry count). We have seen this bug in other areas of Beam > before and in other tools/services on GCP e.g. Cloud Composer. > However, that leads me to my next problem/bug. Even if that is fixed, the > number of retr
[jira] [Commented] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status
[ https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168164#comment-17168164 ] Chamikara Madhusanka Jayalath commented on BEAM-10248: -- Hmm, looking a bit further into code seems like we do set location in the lookup job function. [https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryHelpers.java#L182] [https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L444] > Beam does not set correct region for BigQuery when requesting load job status > - > > Key: BEAM-10248 > URL: https://issues.apache.org/jira/browse/BEAM-10248 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.22.0 > Environment: -Beam 2.22.0 > -DirectRunner & DataflowRunner > -Java 11 >Reporter: Graham Polley >Assignee: Chamikara Madhusanka Jayalath >Priority: P1 > Attachments: Untitled document.pdf > > > I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to > BigQuery. My BigQuery dataset is in region `australia-southeast1`. > If the load job into BigQuery fails for some reason (e.g. table does not > exist, or schema has changed), then an error from BigQuery is returned for > the load. > Beam then enters into a loop of retrying the load job. However, instead of > using a new job id (where the suffix is incremented by 1), it wrongly tries > to reinsert the job using the same job id. This is because when it tries to > look up if the job id already exists, it does not take into consideration the > region where the dataset is. Instead, it defaults to the US but the job was > created in `australia-southeast1`, so it returns `null`. > Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It > returns `null` because it's looking in the wrong region. > If I *test* using a dataset in BigQuery that is in the US region, then it > correctly finds the job id and begins to retry the job with a new job id > (suffixed with the retry count). We have seen this bug in other areas of Beam > before and in other tools/services on GCP e.g. Cloud Composer. > However, that leads me to my next problem/bug. Even if that is fixed, the > number of retries is set to `Integer.MX_VALUE`, so Beam will keep retrying > the job 2,147,483,647 times. This is not good. > The exception is swallowed up by the Beam SDK and never propagated back up > the stack for users to catch and handle. So, if a load job fails there is no > way to handle it and react for users. > I attempted to set `withFailedInsertRetryPolicy()` to transient errors only, > but it is not supported with `FILE_LOADS`. I also tried, using the > `WriteResult` object returned from the bigQuery sink/write, to get a handle > on the error but it does not work. Users need a way to catch and catch failed > load jobs when using `FILE_LOADS`. > > {code:java} > public class TemplatePipeline { > private static final String TOPIC = > "projects/etl-demo-269105/topics/test-micro-batching"; > private static final String BIGQUERY_DESTINATION_FILTERED = > "etl-demo-269105:etl_demo_fun.micro_batch_test_xxx"; > public static void main(String[] args) throws Exception { > try { > PipelineOptionsFactory.register(DataflowPipelineOptions.class); > DataflowPipelineOptions options = PipelineOptionsFactory > .fromArgs(args) > .withoutStrictParsing() > .as(DataflowPipelineOptions.class); > Pipeline pipeline = Pipeline.create(options); > PCollection messages = pipeline > .apply(PubsubIO.readMessagesWithAttributes().fromTopic(String.format(TOPIC, > options.getProject( > .apply(Window.into(FixedWindows.of(Duration.standardSeconds(1; > WriteResult result = messages.apply(ParDo.of(new RowToBQRow())) > .apply(BigQueryIO.writeTableRows() > .to(String.format(BIGQUERY_DESTINATION_FILTERED, options.getProject())) > .withCreateDisposition(CREATE_NEVER) > .withWriteDisposition(WRITE_APPEND) > .withMethod(BigQueryIO.Write.Method.FILE_LOADS) > .withTriggeringFrequency(Duration.standardSeconds(5)) > .withNumFileShards(1) > .withExtendedErrorInfo() > .withSchema(getTableSchema())); > // result.getFailedInsertsWithErr().apply(ParDo.of(new > DoFn() { > // @ProcessElement > // public void processElement(ProcessContext c) { > // for(ErrorProto err : c.element().getError().getErrors()){ > // throw new RuntimeException(err.getMessage()); > // } > // > // } > // })); > result.getFailedInserts().apply(ParDo.of(new DoFn() { > @ProcessElement > public void processElement(ProcessContext c) { > System.out.println(c.element()); > c.ou
[jira] [Commented] (BEAM-10529) Kafka XLang fails for ?empty? key/values
[ https://issues.apache.org/jira/browse/BEAM-10529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168303#comment-17168303 ] Chamikara Madhusanka Jayalath commented on BEAM-10529: -- I think the fix here will be a bit more complex than just changing the coder to NullableCoder.of(ByteArrayCoder.of()). NullableCoder is not a standard coder, so some runners may break when it's set at the cross-language boundary. Should we be adding something like NullableCoder to standard coders ? cc: [~robertwb] > Kafka XLang fails for ?empty? key/values > > > Key: BEAM-10529 > URL: https://issues.apache.org/jira/browse/BEAM-10529 > Project: Beam > Issue Type: Bug > Components: io-java-kafka, sdk-py-core >Reporter: Luke Cwik >Priority: P2 > > It looks like the Javadoc for ByteArrayDeserializer and StringDeserializer > can return null[1, 2] and we aren't using > NullableCoder.of(ByteArrayCoder.of()) in the expansion[3]. Note that KafkaIO > does this correctly in its regular coder inference logic[4]. > 1: > https://kafka.apache.org/21/javadoc/org/apache/kafka/common/serialization/ByteArrayDeserializer.html#deserialize-java.lang.String-byte:A-2: > > https://kafka.apache.org/21/javadoc/org/apache/kafka/common/serialization/StringDeserializer.html#deserialize-java.lang.String-byte:A- > 3: > https://github.com/apache/beam/blob/af2d6b0379d64b522ecb769d88e9e7e7b8900208/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L478 > 4: > https://github.com/apache/beam/blob/af2d6b0379d64b522ecb769d88e9e7e7b8900208/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/LocalDeserializerProvider.java#L85 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status
[ https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168495#comment-17168495 ] Chamikara Madhusanka Jayalath commented on BEAM-10248: -- Actually, we are ignoring the location when looking up a job. Sent out [https://github.com/apache/beam/pull/12431]. > Beam does not set correct region for BigQuery when requesting load job status > - > > Key: BEAM-10248 > URL: https://issues.apache.org/jira/browse/BEAM-10248 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.22.0 > Environment: -Beam 2.22.0 > -DirectRunner & DataflowRunner > -Java 11 >Reporter: Graham Polley >Assignee: Chamikara Madhusanka Jayalath >Priority: P1 > Attachments: Untitled document.pdf > > Time Spent: 20m > Remaining Estimate: 0h > > I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to > BigQuery. My BigQuery dataset is in region `australia-southeast1`. > If the load job into BigQuery fails for some reason (e.g. table does not > exist, or schema has changed), then an error from BigQuery is returned for > the load. > Beam then enters into a loop of retrying the load job. However, instead of > using a new job id (where the suffix is incremented by 1), it wrongly tries > to reinsert the job using the same job id. This is because when it tries to > look up if the job id already exists, it does not take into consideration the > region where the dataset is. Instead, it defaults to the US but the job was > created in `australia-southeast1`, so it returns `null`. > Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It > returns `null` because it's looking in the wrong region. > If I *test* using a dataset in BigQuery that is in the US region, then it > correctly finds the job id and begins to retry the job with a new job id > (suffixed with the retry count). We have seen this bug in other areas of Beam > before and in other tools/services on GCP e.g. Cloud Composer. > However, that leads me to my next problem/bug. Even if that is fixed, the > number of retries is set to `Integer.MX_VALUE`, so Beam will keep retrying > the job 2,147,483,647 times. This is not good. > The exception is swallowed up by the Beam SDK and never propagated back up > the stack for users to catch and handle. So, if a load job fails there is no > way to handle it and react for users. > I attempted to set `withFailedInsertRetryPolicy()` to transient errors only, > but it is not supported with `FILE_LOADS`. I also tried, using the > `WriteResult` object returned from the bigQuery sink/write, to get a handle > on the error but it does not work. Users need a way to catch and catch failed > load jobs when using `FILE_LOADS`. > > {code:java} > public class TemplatePipeline { > private static final String TOPIC = > "projects/etl-demo-269105/topics/test-micro-batching"; > private static final String BIGQUERY_DESTINATION_FILTERED = > "etl-demo-269105:etl_demo_fun.micro_batch_test_xxx"; > public static void main(String[] args) throws Exception { > try { > PipelineOptionsFactory.register(DataflowPipelineOptions.class); > DataflowPipelineOptions options = PipelineOptionsFactory > .fromArgs(args) > .withoutStrictParsing() > .as(DataflowPipelineOptions.class); > Pipeline pipeline = Pipeline.create(options); > PCollection messages = pipeline > .apply(PubsubIO.readMessagesWithAttributes().fromTopic(String.format(TOPIC, > options.getProject( > .apply(Window.into(FixedWindows.of(Duration.standardSeconds(1; > WriteResult result = messages.apply(ParDo.of(new RowToBQRow())) > .apply(BigQueryIO.writeTableRows() > .to(String.format(BIGQUERY_DESTINATION_FILTERED, options.getProject())) > .withCreateDisposition(CREATE_NEVER) > .withWriteDisposition(WRITE_APPEND) > .withMethod(BigQueryIO.Write.Method.FILE_LOADS) > .withTriggeringFrequency(Duration.standardSeconds(5)) > .withNumFileShards(1) > .withExtendedErrorInfo() > .withSchema(getTableSchema())); > // result.getFailedInsertsWithErr().apply(ParDo.of(new > DoFn() { > // @ProcessElement > // public void processElement(ProcessContext c) { > // for(ErrorProto err : c.element().getError().getErrors()){ > // throw new RuntimeException(err.getMessage()); > // } > // > // } > // })); > result.getFailedInserts().apply(ParDo.of(new DoFn() { > @ProcessElement > public void processElement(ProcessContext c) { > System.out.println(c.element()); > c.output("foo"); > throw new RuntimeException("Failed to load"); > } > @FinishBundle > public void finishUp(FinishBundleContext finishBundleContextc){ > System.out.println("Got here"); > } > })); > pipeline.run(); > } catch (Exception e)
[jira] [Commented] (BEAM-10507) support xlang custom window fn
[ https://issues.apache.org/jira/browse/BEAM-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168986#comment-17168986 ] Chamikara Madhusanka Jayalath commented on BEAM-10507: -- Do you have an error ? I think probably the issue is that currently x-lang WindowFns are not supported by Dataflow. > support xlang custom window fn > -- > > Key: BEAM-10507 > URL: https://issues.apache.org/jira/browse/BEAM-10507 > Project: Beam > Issue Type: Bug > Components: cross-language >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: P2 > > support xlang custom window fn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status
[ https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-10248: - Fix Version/s: 2.24.0 > Beam does not set correct region for BigQuery when requesting load job status > - > > Key: BEAM-10248 > URL: https://issues.apache.org/jira/browse/BEAM-10248 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.22.0 > Environment: -Beam 2.22.0 > -DirectRunner & DataflowRunner > -Java 11 >Reporter: Graham Polley >Assignee: Chamikara Madhusanka Jayalath >Priority: P1 > Fix For: 2.24.0 > > Attachments: Untitled document.pdf > > Time Spent: 40m > Remaining Estimate: 0h > > I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to > BigQuery. My BigQuery dataset is in region `australia-southeast1`. > If the load job into BigQuery fails for some reason (e.g. table does not > exist, or schema has changed), then an error from BigQuery is returned for > the load. > Beam then enters into a loop of retrying the load job. However, instead of > using a new job id (where the suffix is incremented by 1), it wrongly tries > to reinsert the job using the same job id. This is because when it tries to > look up if the job id already exists, it does not take into consideration the > region where the dataset is. Instead, it defaults to the US but the job was > created in `australia-southeast1`, so it returns `null`. > Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It > returns `null` because it's looking in the wrong region. > If I *test* using a dataset in BigQuery that is in the US region, then it > correctly finds the job id and begins to retry the job with a new job id > (suffixed with the retry count). We have seen this bug in other areas of Beam > before and in other tools/services on GCP e.g. Cloud Composer. > However, that leads me to my next problem/bug. Even if that is fixed, the > number of retries is set to `Integer.MX_VALUE`, so Beam will keep retrying > the job 2,147,483,647 times. This is not good. > The exception is swallowed up by the Beam SDK and never propagated back up > the stack for users to catch and handle. So, if a load job fails there is no > way to handle it and react for users. > I attempted to set `withFailedInsertRetryPolicy()` to transient errors only, > but it is not supported with `FILE_LOADS`. I also tried, using the > `WriteResult` object returned from the bigQuery sink/write, to get a handle > on the error but it does not work. Users need a way to catch and catch failed > load jobs when using `FILE_LOADS`. > > {code:java} > public class TemplatePipeline { > private static final String TOPIC = > "projects/etl-demo-269105/topics/test-micro-batching"; > private static final String BIGQUERY_DESTINATION_FILTERED = > "etl-demo-269105:etl_demo_fun.micro_batch_test_xxx"; > public static void main(String[] args) throws Exception { > try { > PipelineOptionsFactory.register(DataflowPipelineOptions.class); > DataflowPipelineOptions options = PipelineOptionsFactory > .fromArgs(args) > .withoutStrictParsing() > .as(DataflowPipelineOptions.class); > Pipeline pipeline = Pipeline.create(options); > PCollection messages = pipeline > .apply(PubsubIO.readMessagesWithAttributes().fromTopic(String.format(TOPIC, > options.getProject( > .apply(Window.into(FixedWindows.of(Duration.standardSeconds(1; > WriteResult result = messages.apply(ParDo.of(new RowToBQRow())) > .apply(BigQueryIO.writeTableRows() > .to(String.format(BIGQUERY_DESTINATION_FILTERED, options.getProject())) > .withCreateDisposition(CREATE_NEVER) > .withWriteDisposition(WRITE_APPEND) > .withMethod(BigQueryIO.Write.Method.FILE_LOADS) > .withTriggeringFrequency(Duration.standardSeconds(5)) > .withNumFileShards(1) > .withExtendedErrorInfo() > .withSchema(getTableSchema())); > // result.getFailedInsertsWithErr().apply(ParDo.of(new > DoFn() { > // @ProcessElement > // public void processElement(ProcessContext c) { > // for(ErrorProto err : c.element().getError().getErrors()){ > // throw new RuntimeException(err.getMessage()); > // } > // > // } > // })); > result.getFailedInserts().apply(ParDo.of(new DoFn() { > @ProcessElement > public void processElement(ProcessContext c) { > System.out.println(c.element()); > c.output("foo"); > throw new RuntimeException("Failed to load"); > } > @FinishBundle > public void finishUp(FinishBundleContext finishBundleContextc){ > System.out.println("Got here"); > } > })); > pipeline.run(); > } catch (Exception e) { > e.printStackTrace(); > throw new Exception(e); > } > } > private static TableSchema getTableSchema() { > L
[jira] [Updated] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status
[ https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-10248: - Status: Resolved (was: Triage Needed) > Beam does not set correct region for BigQuery when requesting load job status > - > > Key: BEAM-10248 > URL: https://issues.apache.org/jira/browse/BEAM-10248 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.22.0 > Environment: -Beam 2.22.0 > -DirectRunner & DataflowRunner > -Java 11 >Reporter: Graham Polley >Assignee: Chamikara Madhusanka Jayalath >Priority: P1 > Fix For: 2.24.0 > > Attachments: Untitled document.pdf > > Time Spent: 40m > Remaining Estimate: 0h > > I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to > BigQuery. My BigQuery dataset is in region `australia-southeast1`. > If the load job into BigQuery fails for some reason (e.g. table does not > exist, or schema has changed), then an error from BigQuery is returned for > the load. > Beam then enters into a loop of retrying the load job. However, instead of > using a new job id (where the suffix is incremented by 1), it wrongly tries > to reinsert the job using the same job id. This is because when it tries to > look up if the job id already exists, it does not take into consideration the > region where the dataset is. Instead, it defaults to the US but the job was > created in `australia-southeast1`, so it returns `null`. > Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It > returns `null` because it's looking in the wrong region. > If I *test* using a dataset in BigQuery that is in the US region, then it > correctly finds the job id and begins to retry the job with a new job id > (suffixed with the retry count). We have seen this bug in other areas of Beam > before and in other tools/services on GCP e.g. Cloud Composer. > However, that leads me to my next problem/bug. Even if that is fixed, the > number of retries is set to `Integer.MX_VALUE`, so Beam will keep retrying > the job 2,147,483,647 times. This is not good. > The exception is swallowed up by the Beam SDK and never propagated back up > the stack for users to catch and handle. So, if a load job fails there is no > way to handle it and react for users. > I attempted to set `withFailedInsertRetryPolicy()` to transient errors only, > but it is not supported with `FILE_LOADS`. I also tried, using the > `WriteResult` object returned from the bigQuery sink/write, to get a handle > on the error but it does not work. Users need a way to catch and catch failed > load jobs when using `FILE_LOADS`. > > {code:java} > public class TemplatePipeline { > private static final String TOPIC = > "projects/etl-demo-269105/topics/test-micro-batching"; > private static final String BIGQUERY_DESTINATION_FILTERED = > "etl-demo-269105:etl_demo_fun.micro_batch_test_xxx"; > public static void main(String[] args) throws Exception { > try { > PipelineOptionsFactory.register(DataflowPipelineOptions.class); > DataflowPipelineOptions options = PipelineOptionsFactory > .fromArgs(args) > .withoutStrictParsing() > .as(DataflowPipelineOptions.class); > Pipeline pipeline = Pipeline.create(options); > PCollection messages = pipeline > .apply(PubsubIO.readMessagesWithAttributes().fromTopic(String.format(TOPIC, > options.getProject( > .apply(Window.into(FixedWindows.of(Duration.standardSeconds(1; > WriteResult result = messages.apply(ParDo.of(new RowToBQRow())) > .apply(BigQueryIO.writeTableRows() > .to(String.format(BIGQUERY_DESTINATION_FILTERED, options.getProject())) > .withCreateDisposition(CREATE_NEVER) > .withWriteDisposition(WRITE_APPEND) > .withMethod(BigQueryIO.Write.Method.FILE_LOADS) > .withTriggeringFrequency(Duration.standardSeconds(5)) > .withNumFileShards(1) > .withExtendedErrorInfo() > .withSchema(getTableSchema())); > // result.getFailedInsertsWithErr().apply(ParDo.of(new > DoFn() { > // @ProcessElement > // public void processElement(ProcessContext c) { > // for(ErrorProto err : c.element().getError().getErrors()){ > // throw new RuntimeException(err.getMessage()); > // } > // > // } > // })); > result.getFailedInserts().apply(ParDo.of(new DoFn() { > @ProcessElement > public void processElement(ProcessContext c) { > System.out.println(c.element()); > c.output("foo"); > throw new RuntimeException("Failed to load"); > } > @FinishBundle > public void finishUp(FinishBundleContext finishBundleContextc){ > System.out.println("Got here"); > } > })); > pipeline.run(); > } catch (Exception e) { > e.printStackTrace(); > throw new Exception(e); > } > } > private static TableSchema getTab
[jira] [Commented] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status
[ https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169069#comment-17169069 ] Chamikara Madhusanka Jayalath commented on BEAM-10248: -- Fix for [1] was submitted and I confirmed that [2] is the intended behavior (we retry three times and fail for batch but will keep retrying the workitem for Dataflow streaming). Resolving this. [https://github.com/apache/beam/pull/12431] As mentioned before [3] is a known limitation of BQ sink. > Beam does not set correct region for BigQuery when requesting load job status > - > > Key: BEAM-10248 > URL: https://issues.apache.org/jira/browse/BEAM-10248 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.22.0 > Environment: -Beam 2.22.0 > -DirectRunner & DataflowRunner > -Java 11 >Reporter: Graham Polley >Assignee: Chamikara Madhusanka Jayalath >Priority: P1 > Attachments: Untitled document.pdf > > Time Spent: 40m > Remaining Estimate: 0h > > I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to > BigQuery. My BigQuery dataset is in region `australia-southeast1`. > If the load job into BigQuery fails for some reason (e.g. table does not > exist, or schema has changed), then an error from BigQuery is returned for > the load. > Beam then enters into a loop of retrying the load job. However, instead of > using a new job id (where the suffix is incremented by 1), it wrongly tries > to reinsert the job using the same job id. This is because when it tries to > look up if the job id already exists, it does not take into consideration the > region where the dataset is. Instead, it defaults to the US but the job was > created in `australia-southeast1`, so it returns `null`. > Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It > returns `null` because it's looking in the wrong region. > If I *test* using a dataset in BigQuery that is in the US region, then it > correctly finds the job id and begins to retry the job with a new job id > (suffixed with the retry count). We have seen this bug in other areas of Beam > before and in other tools/services on GCP e.g. Cloud Composer. > However, that leads me to my next problem/bug. Even if that is fixed, the > number of retries is set to `Integer.MX_VALUE`, so Beam will keep retrying > the job 2,147,483,647 times. This is not good. > The exception is swallowed up by the Beam SDK and never propagated back up > the stack for users to catch and handle. So, if a load job fails there is no > way to handle it and react for users. > I attempted to set `withFailedInsertRetryPolicy()` to transient errors only, > but it is not supported with `FILE_LOADS`. I also tried, using the > `WriteResult` object returned from the bigQuery sink/write, to get a handle > on the error but it does not work. Users need a way to catch and catch failed > load jobs when using `FILE_LOADS`. > > {code:java} > public class TemplatePipeline { > private static final String TOPIC = > "projects/etl-demo-269105/topics/test-micro-batching"; > private static final String BIGQUERY_DESTINATION_FILTERED = > "etl-demo-269105:etl_demo_fun.micro_batch_test_xxx"; > public static void main(String[] args) throws Exception { > try { > PipelineOptionsFactory.register(DataflowPipelineOptions.class); > DataflowPipelineOptions options = PipelineOptionsFactory > .fromArgs(args) > .withoutStrictParsing() > .as(DataflowPipelineOptions.class); > Pipeline pipeline = Pipeline.create(options); > PCollection messages = pipeline > .apply(PubsubIO.readMessagesWithAttributes().fromTopic(String.format(TOPIC, > options.getProject( > .apply(Window.into(FixedWindows.of(Duration.standardSeconds(1; > WriteResult result = messages.apply(ParDo.of(new RowToBQRow())) > .apply(BigQueryIO.writeTableRows() > .to(String.format(BIGQUERY_DESTINATION_FILTERED, options.getProject())) > .withCreateDisposition(CREATE_NEVER) > .withWriteDisposition(WRITE_APPEND) > .withMethod(BigQueryIO.Write.Method.FILE_LOADS) > .withTriggeringFrequency(Duration.standardSeconds(5)) > .withNumFileShards(1) > .withExtendedErrorInfo() > .withSchema(getTableSchema())); > // result.getFailedInsertsWithErr().apply(ParDo.of(new > DoFn() { > // @ProcessElement > // public void processElement(ProcessContext c) { > // for(ErrorProto err : c.element().getError().getErrors()){ > // throw new RuntimeException(err.getMessage()); > // } > // > // } > // })); > result.getFailedInserts().apply(ParDo.of(new DoFn() { > @ProcessElement > public void processElement(ProcessContext c) { > System.out.println(c.element()); > c.output("foo"); > throw new RuntimeException("Failed to l
[jira] [Commented] (BEAM-10248) Beam does not set correct region for BigQuery when requesting load job status
[ https://issues.apache.org/jira/browse/BEAM-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170127#comment-17170127 ] Chamikara Madhusanka Jayalath commented on BEAM-10248: -- Exception is not just swallowed but logged in the worker logs right ? Regarding a user hook, currently this can be done by observing the BQ jobs started by Dataflow separately, for example using the bq CLI. We are standardizing the naming pattern for BQ jobs started by a given Dataflow job [1]so that might help identify the correct set of BQ jobs for a given Dataflow job. I understand that it might be easier if users can directly programmatically handle failed load jobs from a pipeline but this is a new feature that has to be designed separately. [~reuvenlax] for thoughts here. [1] [https://github.com/apache/beam/pull/12082] > Beam does not set correct region for BigQuery when requesting load job status > - > > Key: BEAM-10248 > URL: https://issues.apache.org/jira/browse/BEAM-10248 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.22.0 > Environment: -Beam 2.22.0 > -DirectRunner & DataflowRunner > -Java 11 >Reporter: Graham Polley >Assignee: Chamikara Madhusanka Jayalath >Priority: P1 > Fix For: 2.24.0 > > Attachments: Untitled document.pdf > > Time Spent: 50m > Remaining Estimate: 0h > > I am using `FILE_LOADS` (micro-batching) from Pub/Sub and writing to > BigQuery. My BigQuery dataset is in region `australia-southeast1`. > If the load job into BigQuery fails for some reason (e.g. table does not > exist, or schema has changed), then an error from BigQuery is returned for > the load. > Beam then enters into a loop of retrying the load job. However, instead of > using a new job id (where the suffix is incremented by 1), it wrongly tries > to reinsert the job using the same job id. This is because when it tries to > look up if the job id already exists, it does not take into consideration the > region where the dataset is. Instead, it defaults to the US but the job was > created in `australia-southeast1`, so it returns `null`. > Bug is on LN308 of `BigQueryHelpers.java`. It should not return `null`. It > returns `null` because it's looking in the wrong region. > If I *test* using a dataset in BigQuery that is in the US region, then it > correctly finds the job id and begins to retry the job with a new job id > (suffixed with the retry count). We have seen this bug in other areas of Beam > before and in other tools/services on GCP e.g. Cloud Composer. > However, that leads me to my next problem/bug. Even if that is fixed, the > number of retries is set to `Integer.MX_VALUE`, so Beam will keep retrying > the job 2,147,483,647 times. This is not good. > The exception is swallowed up by the Beam SDK and never propagated back up > the stack for users to catch and handle. So, if a load job fails there is no > way to handle it and react for users. > I attempted to set `withFailedInsertRetryPolicy()` to transient errors only, > but it is not supported with `FILE_LOADS`. I also tried, using the > `WriteResult` object returned from the bigQuery sink/write, to get a handle > on the error but it does not work. Users need a way to catch and catch failed > load jobs when using `FILE_LOADS`. > > {code:java} > public class TemplatePipeline { > private static final String TOPIC = > "projects/etl-demo-269105/topics/test-micro-batching"; > private static final String BIGQUERY_DESTINATION_FILTERED = > "etl-demo-269105:etl_demo_fun.micro_batch_test_xxx"; > public static void main(String[] args) throws Exception { > try { > PipelineOptionsFactory.register(DataflowPipelineOptions.class); > DataflowPipelineOptions options = PipelineOptionsFactory > .fromArgs(args) > .withoutStrictParsing() > .as(DataflowPipelineOptions.class); > Pipeline pipeline = Pipeline.create(options); > PCollection messages = pipeline > .apply(PubsubIO.readMessagesWithAttributes().fromTopic(String.format(TOPIC, > options.getProject( > .apply(Window.into(FixedWindows.of(Duration.standardSeconds(1; > WriteResult result = messages.apply(ParDo.of(new RowToBQRow())) > .apply(BigQueryIO.writeTableRows() > .to(String.format(BIGQUERY_DESTINATION_FILTERED, options.getProject())) > .withCreateDisposition(CREATE_NEVER) > .withWriteDisposition(WRITE_APPEND) > .withMethod(BigQueryIO.Write.Method.FILE_LOADS) > .withTriggeringFrequency(Duration.standardSeconds(5)) > .withNumFileShards(1) > .withExtendedErrorInfo() > .withSchema(getTableSchema())); > // result.getFailedInsertsWithErr().apply(ParDo.of(new > DoFn() { > // @ProcessElement > // public void processElement(ProcessCont
[jira] [Created] (BEAM-10411) Add an example for cross-language Kafka
Chamikara Madhusanka Jayalath created BEAM-10411: Summary: Add an example for cross-language Kafka Key: BEAM-10411 URL: https://issues.apache.org/jira/browse/BEAM-10411 Project: Beam Issue Type: Improvement Components: io-py-kafka Reporter: Chamikara Madhusanka Jayalath Assignee: Chamikara Madhusanka Jayalath -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-7014) Flake in gcsio.py / filesystemio.py - NotImplementedError: offset: 0, whence: 0
[ https://issues.apache.org/jira/browse/BEAM-7014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154243#comment-17154243 ] Chamikara Madhusanka Jayalath commented on BEAM-7014: - Another possible occurrence: [https://stackoverflow.com/questions/62805319/dataflow-job-fails-with-httperror-notimplementederror/62809015#62809015] [~udim] will you able to look into this ? Possibly we need to further extend seek() implementation here: [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystemio.py#L286] > Flake in gcsio.py / filesystemio.py - NotImplementedError: offset: 0, whence: > 0 > --- > > Key: BEAM-7014 > URL: https://issues.apache.org/jira/browse/BEAM-7014 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Priority: P1 > Labels: beam-fixit, flake > > The flake was observed in Precommit Direct Runner IT (wordcount). > Full log output: https://pastebin.com/raw/DP5J7Uch. > {noformat} > Traceback (most recent call last): > 08:42:57 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/io/gcp/gcsio.py", > line 583, in _start_upload > 08:42:57 self._client.objects.Insert(self._insert_request, > upload=self._upload) > 08:42:57 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/io/gcp/internal/clients/storage/storage_v1_client.py", > line 1154, in Insert > 08:42:57 upload=upload, upload_config=upload_config) > 08:42:57 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/apitools/base/py/base_api.py", > line 715, in _RunMethod > 08:42:57 http_request, client=self.client) > 08:42:57 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/apitools/base/py/transfer.py", > line 885, in InitializeUpload > 08:42:57 return self.StreamInChunks() > 08:42:57 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/apitools/base/py/transfer.py", > line 997, in StreamInChunks > 08:42:57 additional_headers=additional_headers) > 08:42:57 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/apitools/base/py/transfer.py", > line 948, in __StreamMedia > 08:42:57 self.RefreshResumableUploadState() > 08:42:57 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/apitools/base/py/transfer.py", > line 850, in RefreshResumableUploadState > 08:42:57 self.stream.seek(self.progress) > 08:42:57 File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/io/filesystemio.py", > line 269, in seek > 08:42:57 offset, whence, self.position, self.last_position)) > 08:42:57 NotImplementedError: offset: 0, whence: 0, position: 48944, last: 0 > {noformat} > [~chamikara] Might have context to triage this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10638) Update documentation of BQ InsertRetryPolicy
[ https://issues.apache.org/jira/browse/BEAM-10638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171091#comment-17171091 ] Chamikara Madhusanka Jayalath commented on BEAM-10638: -- Heejong, assigning to you since you are pretty familiar with this codepath. > Update documentation of BQ InsertRetryPolicy > > > Key: BEAM-10638 > URL: https://issues.apache.org/jira/browse/BEAM-10638 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Reporter: Udi Meiri >Assignee: Heejong Lee >Priority: P2 > > Clarify that it only applies to batches where BigQuery returns per-element > failures. > This may happen in cases where the entire batch is rejected, such as for > exceeding the 10MB size limit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-10638) Update documentation of BQ InsertRetryPolicy
[ https://issues.apache.org/jira/browse/BEAM-10638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath reassigned BEAM-10638: Assignee: Heejong Lee > Update documentation of BQ InsertRetryPolicy > > > Key: BEAM-10638 > URL: https://issues.apache.org/jira/browse/BEAM-10638 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Reporter: Udi Meiri >Assignee: Heejong Lee >Priority: P2 > > Clarify that it only applies to batches where BigQuery returns per-element > failures. > This may happen in cases where the entire batch is rejected, such as for > exceeding the 10MB size limit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-10655) NullPointerException when writing "google.protobuf.Timestamp" to BigQuery through schemas
Chamikara Madhusanka Jayalath created BEAM-10655: Summary: NullPointerException when writing "google.protobuf.Timestamp" to BigQuery through schemas Key: BEAM-10655 URL: https://issues.apache.org/jira/browse/BEAM-10655 Project: Beam Issue Type: Improvement Components: io-java-gcp Reporter: Chamikara Madhusanka Jayalath See here for details and instructions for reproducing. [https://lists.apache.org/thread.html/r1657ccf41dda3f2d9d082c5ebb006dd2da92863983971ad23485f16e%40%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10655) NullPointerException when writing "google.protobuf.Timestamp" to BigQuery through schemas
[ https://issues.apache.org/jira/browse/BEAM-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17172675#comment-17172675 ] Chamikara Madhusanka Jayalath commented on BEAM-10655: -- cc: [~robinyqiu] who has updated BigQueryIO + Beam Schema support recently. > NullPointerException when writing "google.protobuf.Timestamp" to BigQuery > through schemas > - > > Key: BEAM-10655 > URL: https://issues.apache.org/jira/browse/BEAM-10655 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Chamikara Madhusanka Jayalath >Priority: P2 > > See here for details and instructions for reproducing. > [https://lists.apache.org/thread.html/r1657ccf41dda3f2d9d082c5ebb006dd2da92863983971ad23485f16e%40%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9932) Add documentation describing cross-language test pipelines
[ https://issues.apache.org/jira/browse/BEAM-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-9932: Fix Version/s: 2.24.0 > Add documentation describing cross-language test pipelines > -- > > Key: BEAM-9932 > URL: https://issues.apache.org/jira/browse/BEAM-9932 > Project: Beam > Issue Type: Improvement > Components: cross-language >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kevin Sijo Puthusseri >Priority: P2 > Fix For: 2.24.0 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > We designed cross-language test pipelines [1][2] based on the discussion in > [3]. > Adding some pydocs and Java docs regarding rational behind each pipeline will > be helpful. > [1] > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/validate_runner_xlang_test.py] > [2] > [https://github.com/apache/beam/blob/master/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ValidateRunnerXlangTest.java] > [3] > [https://docs.google.com/document/d/1xQp0ElIV84b8OCVz8CD2hvbiWdR8w4BvWxPTZJZA6NA/edit?usp=sharing] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9932) Add documentation describing cross-language test pipelines
[ https://issues.apache.org/jira/browse/BEAM-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-9932: Status: Resolved (was: In Progress) > Add documentation describing cross-language test pipelines > -- > > Key: BEAM-9932 > URL: https://issues.apache.org/jira/browse/BEAM-9932 > Project: Beam > Issue Type: Improvement > Components: cross-language >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kevin Sijo Puthusseri >Priority: P2 > Time Spent: 3h 50m > Remaining Estimate: 0h > > We designed cross-language test pipelines [1][2] based on the discussion in > [3]. > Adding some pydocs and Java docs regarding rational behind each pipeline will > be helpful. > [1] > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/validate_runner_xlang_test.py] > [2] > [https://github.com/apache/beam/blob/master/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ValidateRunnerXlangTest.java] > [3] > [https://docs.google.com/document/d/1xQp0ElIV84b8OCVz8CD2hvbiWdR8w4BvWxPTZJZA6NA/edit?usp=sharing] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10529) Kafka XLang fails for ?empty? key/values
[ https://issues.apache.org/jira/browse/BEAM-10529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173425#comment-17173425 ] Chamikara Madhusanka Jayalath commented on BEAM-10529: -- cc: [~robertwb] [~bhulette] > Kafka XLang fails for ?empty? key/values > > > Key: BEAM-10529 > URL: https://issues.apache.org/jira/browse/BEAM-10529 > Project: Beam > Issue Type: Bug > Components: io-java-kafka, sdk-py-core >Reporter: Luke Cwik >Priority: P2 > > It looks like the Javadoc for ByteArrayDeserializer and StringDeserializer > can return null[1, 2] and we aren't using > NullableCoder.of(ByteArrayCoder.of()) in the expansion[3]. Note that KafkaIO > does this correctly in its regular coder inference logic[4]. > 1: > https://kafka.apache.org/21/javadoc/org/apache/kafka/common/serialization/ByteArrayDeserializer.html#deserialize-java.lang.String-byte:A-2: > > https://kafka.apache.org/21/javadoc/org/apache/kafka/common/serialization/StringDeserializer.html#deserialize-java.lang.String-byte:A- > 3: > https://github.com/apache/beam/blob/af2d6b0379d64b522ecb769d88e9e7e7b8900208/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L478 > 4: > https://github.com/apache/beam/blob/af2d6b0379d64b522ecb769d88e9e7e7b8900208/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/LocalDeserializerProvider.java#L85 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10529) Kafka XLang fails for ?empty? key/values
[ https://issues.apache.org/jira/browse/BEAM-10529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173431#comment-17173431 ] Chamikara Madhusanka Jayalath commented on BEAM-10529: -- Are the Python and Java nullable coders compatible ? I.e. can you encode from Java and decode from Python and vice versa ? > Kafka XLang fails for ?empty? key/values > > > Key: BEAM-10529 > URL: https://issues.apache.org/jira/browse/BEAM-10529 > Project: Beam > Issue Type: Bug > Components: io-java-kafka, sdk-py-core >Reporter: Luke Cwik >Priority: P2 > > It looks like the Javadoc for ByteArrayDeserializer and StringDeserializer > can return null[1, 2] and we aren't using > NullableCoder.of(ByteArrayCoder.of()) in the expansion[3]. Note that KafkaIO > does this correctly in its regular coder inference logic[4]. > 1: > https://kafka.apache.org/21/javadoc/org/apache/kafka/common/serialization/ByteArrayDeserializer.html#deserialize-java.lang.String-byte:A-2: > > https://kafka.apache.org/21/javadoc/org/apache/kafka/common/serialization/StringDeserializer.html#deserialize-java.lang.String-byte:A- > 3: > https://github.com/apache/beam/blob/af2d6b0379d64b522ecb769d88e9e7e7b8900208/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L478 > 4: > https://github.com/apache/beam/blob/af2d6b0379d64b522ecb769d88e9e7e7b8900208/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/LocalDeserializerProvider.java#L85 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9922) Add Go SDK tests to cross-language Spark ValidatesRunner test suite
[ https://issues.apache.org/jira/browse/BEAM-9922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-9922: Description: Test suite is here: [https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/] was: Test suite is here: [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Spark/] > Add Go SDK tests to cross-language Spark ValidatesRunner test suite > --- > > Key: BEAM-9922 > URL: https://issues.apache.org/jira/browse/BEAM-9922 > Project: Beam > Issue Type: Sub-task > Components: sdk-go >Reporter: Chamikara Madhusanka Jayalath >Priority: P3 > > Test suite is here: > [https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9921) Add Go SDK tests to cross-language Flink ValidatesRunner test suite
[ https://issues.apache.org/jira/browse/BEAM-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-9921: Description: Test suite is here: [https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/] (was: Test suite is here: [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Flink/]) > Add Go SDK tests to cross-language Flink ValidatesRunner test suite > --- > > Key: BEAM-9921 > URL: https://issues.apache.org/jira/browse/BEAM-9921 > Project: Beam > Issue Type: Sub-task > Components: sdk-go >Reporter: Chamikara Madhusanka Jayalath >Priority: P3 > > Test suite is here: > [https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-9918) Cross-language transforms support for Go SDK
[ https://issues.apache.org/jira/browse/BEAM-9918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath reassigned BEAM-9918: --- Assignee: (was: Kevin Sijo Puthusseri) > Cross-language transforms support for Go SDK > > > Key: BEAM-9918 > URL: https://issues.apache.org/jira/browse/BEAM-9918 > Project: Beam > Issue Type: New Feature > Components: cross-language, sdk-go >Reporter: Chamikara Madhusanka Jayalath >Priority: P2 > > This is an uber issue for tasks related to cross-language transforms support > for Go SDK. We can create sub-tasks as needed. > cc: [~lostluck] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-10698) X-lang Kafka is broken for DataflowRunner due to timestamps being out of bound
Chamikara Madhusanka Jayalath created BEAM-10698: Summary: X-lang Kafka is broken for DataflowRunner due to timestamps being out of bound Key: BEAM-10698 URL: https://issues.apache.org/jira/browse/BEAM-10698 Project: Beam Issue Type: Bug Components: cross-language, io-py-kafka Reporter: Chamikara Madhusanka Jayalath Assignee: Boyuan Zhang Seems like this is a regression introduced by [https://github.com/apache/beam/pull/11749] I have a short term fix here: [https://github.com/apache/beam/pull/12557] We should either. (1) Include [https://github.com/apache/beam/pull/12557] in the release branch (2) Find the underlying issue that produces wrong timestamps and do that fix (3) Revert [https://github.com/apache/beam/pull/11749] from the 2.24.0 release branch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10698) X-lang Kafka is broken for DataflowRunner due to timestamps being out of bound
[ https://issues.apache.org/jira/browse/BEAM-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-10698: - Fix Version/s: 2.24.0 > X-lang Kafka is broken for DataflowRunner due to timestamps being out of bound > -- > > Key: BEAM-10698 > URL: https://issues.apache.org/jira/browse/BEAM-10698 > Project: Beam > Issue Type: Bug > Components: cross-language, io-py-kafka >Reporter: Chamikara Madhusanka Jayalath >Assignee: Boyuan Zhang >Priority: P1 > Fix For: 2.24.0 > > > Seems like this is a regression introduced by > [https://github.com/apache/beam/pull/11749] > I have a short term fix here: [https://github.com/apache/beam/pull/12557] > We should either. > (1) Include [https://github.com/apache/beam/pull/12557] in the release branch > (2) Find the underlying issue that produces wrong timestamps and do that fix > (3) Revert [https://github.com/apache/beam/pull/11749] from the 2.24.0 > release branch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10698) SDFs broken for DataflowRunner o UW due to timestamps being out of bound
[ https://issues.apache.org/jira/browse/BEAM-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-10698: - Summary: SDFs broken for DataflowRunner o UW due to timestamps being out of bound (was: X-lang Kafka is broken for DataflowRunner due to timestamps being out of bound) > SDFs broken for DataflowRunner o UW due to timestamps being out of bound > > > Key: BEAM-10698 > URL: https://issues.apache.org/jira/browse/BEAM-10698 > Project: Beam > Issue Type: Bug > Components: cross-language, io-py-kafka >Reporter: Chamikara Madhusanka Jayalath >Assignee: Boyuan Zhang >Priority: P1 > Fix For: 2.24.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Seems like this is a regression introduced by > [https://github.com/apache/beam/pull/11749] > I have a short term fix here: [https://github.com/apache/beam/pull/12557] > We should either. > (1) Include [https://github.com/apache/beam/pull/12557] in the release branch > (2) Find the underlying issue that produces wrong timestamps and do that fix > (3) Revert [https://github.com/apache/beam/pull/11749] from the 2.24.0 > release branch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10698) SDFs broken for DataflowRunner o UW due to timestamps being out of bound
[ https://issues.apache.org/jira/browse/BEAM-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-10698: - Priority: P0 (was: P1) > SDFs broken for DataflowRunner o UW due to timestamps being out of bound > > > Key: BEAM-10698 > URL: https://issues.apache.org/jira/browse/BEAM-10698 > Project: Beam > Issue Type: Bug > Components: cross-language, io-py-kafka >Reporter: Chamikara Madhusanka Jayalath >Assignee: Boyuan Zhang >Priority: P0 > Fix For: 2.24.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Seems like this is a regression introduced by > [https://github.com/apache/beam/pull/11749] > I have a short term fix here: [https://github.com/apache/beam/pull/12557] > We should either. > (1) Include [https://github.com/apache/beam/pull/12557] in the release branch > (2) Find the underlying issue that produces wrong timestamps and do that fix > (3) Revert [https://github.com/apache/beam/pull/11749] from the 2.24.0 > release branch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10698) SDFs broken for Dataflow runner v2 due to timestamps being out of bound
[ https://issues.apache.org/jira/browse/BEAM-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-10698: - Summary: SDFs broken for Dataflow runner v2 due to timestamps being out of bound (was: SDFs broken for DataflowRunner o UW due to timestamps being out of bound) > SDFs broken for Dataflow runner v2 due to timestamps being out of bound > --- > > Key: BEAM-10698 > URL: https://issues.apache.org/jira/browse/BEAM-10698 > Project: Beam > Issue Type: Bug > Components: cross-language, io-py-kafka >Reporter: Chamikara Madhusanka Jayalath >Assignee: Boyuan Zhang >Priority: P0 > Fix For: 2.24.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Seems like this is a regression introduced by > [https://github.com/apache/beam/pull/11749] > I have a short term fix here: [https://github.com/apache/beam/pull/12557] > We should either. > (1) Include [https://github.com/apache/beam/pull/12557] in the release branch > (2) Find the underlying issue that produces wrong timestamps and do that fix > (3) Revert [https://github.com/apache/beam/pull/11749] from the 2.24.0 > release branch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10698) SDFs broken for Dataflow runner v2 due to timestamps being out of bound
[ https://issues.apache.org/jira/browse/BEAM-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath updated BEAM-10698: - Component/s: runner-dataflow > SDFs broken for Dataflow runner v2 due to timestamps being out of bound > --- > > Key: BEAM-10698 > URL: https://issues.apache.org/jira/browse/BEAM-10698 > Project: Beam > Issue Type: Bug > Components: cross-language, io-py-kafka, runner-dataflow >Reporter: Chamikara Madhusanka Jayalath >Assignee: Boyuan Zhang >Priority: P0 > Fix For: 2.24.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Seems like this is a regression introduced by > [https://github.com/apache/beam/pull/11749] > I have a short term fix here: [https://github.com/apache/beam/pull/12557] > We should either. > (1) Include [https://github.com/apache/beam/pull/12557] in the release branch > (2) Find the underlying issue that produces wrong timestamps and do that fix > (3) Revert [https://github.com/apache/beam/pull/11749] from the 2.24.0 > release branch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10716) PubSub related tests failing due to subscriptions-per-project exceeded limit:10000
[ https://issues.apache.org/jira/browse/BEAM-10716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179988#comment-17179988 ] Chamikara Madhusanka Jayalath commented on BEAM-10716: -- I ran into this a number of times when running Python PreCommit. For example, [https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/2108/] ERROR: test_streaming_wordcount_it (apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT) -- Traceback (most recent call last): File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/sdks/python/apache_beam/examples/streaming_wordcount_it_test.py", line 65, in setUp self.input_topic.name) File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/build/gradleenv/-1734967052/lib/python3.7/site-packages/google/cloud/pubsub_v1/_gapic.py", line 40, in fx = lambda self, *a, **kw: wrapped_fx(self.api, *a, **kw) # noqa File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/build/gradleenv/-1734967052/lib/python3.7/site-packages/google/cloud/pubsub_v1/gapic/subscriber_client.py", line 439, in create_subscription request, retry=retry, timeout=timeout, metadata=metadata File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/build/gradleenv/-1734967052/lib/python3.7/site-packages/google/api_core/gapic_v1/method.py", line 145, in __call__ return wrapped_func(*args, **kwargs) File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/build/gradleenv/-1734967052/lib/python3.7/site-packages/google/api_core/retry.py", line 286, in retry_wrapped_func on_error=on_error, File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/build/gradleenv/-1734967052/lib/python3.7/site-packages/google/api_core/retry.py", line 184, in retry_target return target() File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/build/gradleenv/-1734967052/lib/python3.7/site-packages/google/api_core/timeout.py", line 214, in func_with_timeout return func(*args, **kwargs) File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/build/gradleenv/-1734967052/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 59, in error_remapped_callable six.raise_from(exceptions.from_grpc_error(exc), exc) File "", line 3, in raise_from google.api_core.exceptions.ResourceExhausted: 429 Your project has exceeded a limit: (type="subscriptions-per-project", current=1, maximum=1). > PubSub related tests failing due to subscriptions-per-project exceeded > limit:1 > -- > > Key: BEAM-10716 > URL: https://issues.apache.org/jira/browse/BEAM-10716 > Project: Beam > Issue Type: Bug > Components: test-failures >Reporter: Robin Qiu >Assignee: Alan Myrvold >Priority: P1 > Labels: currently-failing > > Example: > [https://ci-beam.apache.org/job/beam_PostCommit_Python37/2729/#showFailuresLink] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-9918) Cross-language transforms support for Go SDK
[ https://issues.apache.org/jira/browse/BEAM-9918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath reassigned BEAM-9918: --- Assignee: Kevin Sijo Puthusseri > Cross-language transforms support for Go SDK > > > Key: BEAM-9918 > URL: https://issues.apache.org/jira/browse/BEAM-9918 > Project: Beam > Issue Type: New Feature > Components: cross-language, sdk-go >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kevin Sijo Puthusseri >Priority: P2 > > This is an uber issue for tasks related to cross-language transforms support > for Go SDK. We can create sub-tasks as needed. > cc: [~lostluck] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-6485) Cross-language transform expansion protocol
[ https://issues.apache.org/jira/browse/BEAM-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath resolved BEAM-6485. - Fix Version/s: Not applicable Resolution: Fixed > Cross-language transform expansion protocol > > > Key: BEAM-6485 > URL: https://issues.apache.org/jira/browse/BEAM-6485 > Project: Beam > Issue Type: New Feature > Components: beam-model, runner-core, sdk-java-core, sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Priority: P3 > Fix For: Not applicable > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-10271) Remove experimental annotations from stabilized GCP connector APIs
Chamikara Madhusanka Jayalath created BEAM-10271: Summary: Remove experimental annotations from stabilized GCP connector APIs Key: BEAM-10271 URL: https://issues.apache.org/jira/browse/BEAM-10271 Project: Beam Issue Type: Improvement Components: io-java-gcp, io-py-gcp Reporter: Chamikara Madhusanka Jayalath -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-9932) Add documentation describing cross-language test pipelines
[ https://issues.apache.org/jira/browse/BEAM-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Madhusanka Jayalath reassigned BEAM-9932: --- Assignee: Kevin Sijo Puthusseri > Add documentation describing cross-language test pipelines > -- > > Key: BEAM-9932 > URL: https://issues.apache.org/jira/browse/BEAM-9932 > Project: Beam > Issue Type: Improvement > Components: cross-language >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kevin Sijo Puthusseri >Priority: P2 > > We designed cross-language test pipelines [1][2] based on the discussion in > [3]. > Adding some pydocs and Java docs regarding rational behind each pipeline will > be helpful. > [1] > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/validate_runner_xlang_test.py] > [2] > [https://github.com/apache/beam/blob/master/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ValidateRunnerXlangTest.java] > [3] > [https://docs.google.com/document/d/1xQp0ElIV84b8OCVz8CD2hvbiWdR8w4BvWxPTZJZA6NA/edit?usp=sharing] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-6514) Dataflow Batch Job Failure is leaving Datasets/Tables behind in BigQuery
[ https://issues.apache.org/jira/browse/BEAM-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139950#comment-17139950 ] Chamikara Madhusanka Jayalath commented on BEAM-6514: - This is because there's currently no good way to cleanup failing Dataflow/Beam jobs. If the job is successful BQ connector will perform the cleanup if the job fails cleanup may have to be done manually. > Dataflow Batch Job Failure is leaving Datasets/Tables behind in BigQuery > > > Key: BEAM-6514 > URL: https://issues.apache.org/jira/browse/BEAM-6514 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Reporter: Rumeshkrishnan Mohan >Assignee: Chamikara Madhusanka Jayalath >Priority: P2 > Labels: stale-assigned > > Dataflow is leaving Datasets/Tables behind in BigQuery when the pipeline is > cancelled or when it fails. I cancelled a job or it failed at run time, and > it left behind a dataset and table in BigQuery. > # `cleanupTempResource` method involves cleaning tables and dataset after > batch job succeed. > # If job failed in the middle or cancelled explicitly, the temporary dataset > and tables remain exist. I do see the table expire period 1 day as per code > in `getTableToExtract` function written in BigQueryQuerySource.java. > # I can understand that, keep temp tables and dataset when failure for > debugging. > # Can we have pipeline or job optional parameters which get clean temporary > dataset and tables when cancel or fail ? -- This message was sent by Atlassian Jira (v8.3.4#803005)