Re: KafkaIO metric publishing

2024-06-19 Thread XQ Hu via user
Is your job a Dataflow Template job? The error is caused by https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/util/DataflowTemplateJob.java#L55 . So basically DataflowTemplateJob does not support metrics. On Wed, Jun

Re: Apache Bean on GCP / Forcing to use py 3.11

2024-06-16 Thread XQ Hu via user
gt;>>>> I was using a setup.py as well, but then i commented out the usage in >>>>>> the dockerfile after checking some flex templates which said it is not >>>>>> needed >>>>>> >>>>>> >>>>>

Re: Apache Bean on GCP / Forcing to use py 3.11

2024-06-12 Thread XQ Hu via user
ot;: Step #4 - "dftester-image": >>> section = defn.get("tool", {})[tool_name] >>> Step #0 - "build-shareloader-template": Step #4 - "dftester-image": >>> ~~~~~~~~^^^^^^^ >>

Re: Apache Bean on GCP / Forcing to use py 3.11

2024-06-11 Thread XQ Hu via user
age": > KeyError: 'setuptools_scm' > Step #0 - "build-shareloader-template": Step #4 - "dftester-image": > running bdist_wheel > > > > > It is somehow getting messed up with a toml ? > > > Could anyone advise? > > thank

Re: Apache Bean on GCP / Forcing to use py 3.11

2024-06-10 Thread XQ Hu via user
https://github.com/GoogleCloudPlatform/python-docs-samples/tree/main/dataflow/flex-templates/pipeline_with_dependencies is a great example. On Mon, Jun 10, 2024 at 4:28 PM Valentyn Tymofieiev via user < user@beam.apache.org> wrote: > In this case the Python version will be defined by the Python

Re: Beam + VertexAI

2024-06-09 Thread XQ Hu via user
If you have a Vertex AI model, try https://cloud.google.com/dataflow/docs/notebooks/run_inference_vertex_ai If you want to use the Vertex AI model to do text embedding, try https://cloud.google.com/dataflow/docs/notebooks/vertex_ai_text_embeddings On Sun, Jun 9, 2024 at 4:40 AM Sofia’s World

Re: Question: Pipelines Stuck with Java 21 and BigQuery Storage Write API

2024-06-03 Thread XQ Hu via user
Probably related to the strict encapsulation that is enforced with Java 21. Use `--add-opens=java.base/java.lang=ALL-UNNAMED` as the JVM flag could be a temporary workaround. On Mon, Jun 3, 2024 at 3:04 AM 田中万葉 wrote: > Hi all, > > I encountered an UnsupportedOperationException when using Java

Re: Query about autinference of numPartitions for `JdbcIO#readWithPartitions`

2024-05-31 Thread XQ Hu via user
You should be able to configure the number of partition like this: https://github.com/GoogleCloudPlatform/dataflow-cookbook/blob/main/Java/src/main/java/jdbc/ReadPartitionsJdbc.java#L132 The code to auto infer the number of partitions seems to be unreachable (I haven't checked this carefully).

Re: Error handling for GCP Pub/Sub on Dataflow using Python

2024-05-25 Thread XQ Hu via user
I do not suggest you handle this in beam.io.WriteToPubSub. You could change your pipeline to add one transform to check the message size. If it is beyond 10 MB, you could use another sink or process the message to reduce the size. On Fri, May 24, 2024 at 3:46 AM Nimrod Shory wrote: > Hello

Re: Question: Java Apache Beam, mock external Clients initialized in Setup

2024-05-25 Thread XQ Hu via user
I am not sure which part you want to test. If the processData part should be tested, you could refactor the code without use any Beam specific code and test the processing data logic. >From your example, it seems that you are calling some APIs, we recently added a new Web API IO:

Re: Fails to deploy a python pipeline to a flink cluster

2024-05-11 Thread XQ Hu via user
Do you still have the same issue? I tried to follow your setup.sh to reproduce this but somehow I am stuck at the word_len step. I saw you also tried to use `print(kafka_kv)` to debug it. I am not sure about your current status. On Fri, May 10, 2024 at 9:18 AM Jaehyeon Kim wrote: > Hello, > >

Re: Pipeline gets stuck when chaining two SDFs (Python SDK)

2024-05-05 Thread XQ Hu via user
it works with the FlinkRunner. Thank you so much! > > Cheers, > Jaehyeon > > [image: image.png] > > On Mon, 6 May 2024 at 02:49, XQ Hu via user wrote: > >> Have you tried to use other runners? I think this might be caused by some >> gaps in Python Direct

Re: Pipeline gets stuck when chaining two SDFs (Python SDK)

2024-05-05 Thread XQ Hu via user
; Jaehyeon > > On Sun, 5 May 2024 at 09:21, XQ Hu via user wrote: > >> I played with your example. Indeed, create_tracker in your ProcessFilesFn >> is never called, which is quite strange. >> I could not find any example that shows the chained SDFs, which makes me &g

Re: Pipeline gets stuck when chaining two SDFs (Python SDK)

2024-05-04 Thread XQ Hu via user
I played with your example. Indeed, create_tracker in your ProcessFilesFn is never called, which is quite strange. I could not find any example that shows the chained SDFs, which makes me wonder whether the chained SDFs work. @Chamikara Jayalath Any thoughts? On Fri, May 3, 2024 at 2:45 AM

Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread XQ Hu via user
is not clear to me is what are the advantages of using it? Is > > >> only the error/retry handling? anything in terms of performance? > > >> > > >> My PCollection is unbounded but I was thinking of sending my messages > > >> in batches to

Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread XQ Hu via user
sages > in batches to the external API in order to gain some performance > (don't expect to send 1 http request per message). > > Thank you very much for all your responses! > > > On Sun, Apr 14, 2024 at 8:28 AM XQ Hu via user > wrote: > > > > To enrich your data

Re: Any recomendation for key for GroupIntoBatches

2024-04-14 Thread XQ Hu via user
To enrich your data, have you checked https://cloud.google.com/dataflow/docs/guides/enrichment? This transform is built on top of https://beam.apache.org/documentation/io/built-in/webapis/ On Fri, Apr 12, 2024 at 4:38 PM Ruben Vargas wrote: > On Fri, Apr 12, 2024 at 2:17 PM Jaehyeon Kim

Re: Is Pulsar IO Connector Officially Supported?

2024-04-10 Thread XQ Hu via user
Sounds like a good idea to add a new section. Let me chat with the team about that. Thanks. On Wed, Apr 10, 2024 at 12:09 PM Ahmet Altay wrote: > Pulsar IO did not change much since it was originally added in 2022. You > can find about the gaps in this presentation ( >

Re: Is Pulsar IO Connector Officially Supported?

2024-04-10 Thread XQ Hu via user
I think PulsarIO needs more work to be polished. On Tue, Apr 9, 2024 at 2:48 PM Vince Castello via user wrote: > I see that a Pulsar connector was made available as of BEAM 2.38.0 release > but I don't see Pulsar as an official connector on the page below. Is the > Pulsar IO connector official

Re: how to enable debugging mode for python worker harness

2024-03-31 Thread XQ Hu via user
you can share what you've changed (maybe a PR) so that I can > test again on my linux machine. Thanks so much for your help. There's > someone else also pinging me on the same error when testing, and I do want > to make this work for everyone. Thanks! > > > > On Mon, Ma

Re: DLQ Implementation

2024-03-27 Thread XQ Hu via user
You can check https://github.com/search?q=repo%3Ajohnjcasey%2Fbeam%20withBadRecordErrorHandler=code. The test codes show how to use them. More doc will be added later. On Wed, Mar 27, 2024 at 7:15 PM Ruben Vargas wrote: > Hello all > > Maybe a silly question. Are there any suggestions for

Re: What's the current status of pattern matching with Beam SQL?

2024-03-24 Thread XQ Hu via user
y > eg) Apache Flink CEP. > > Both the tickets have an associating GitHub issue and no update for more > than 1 year, which means they are not likely to be completed in the near > future? > > Cheers, > Jaehyeon > > > On Sun, 24 Mar 2024 at 12:02, XQ Hu v

Re: Specific use-case question - Kafka-to-GCS-avro-Python

2024-03-23 Thread XQ Hu via user
or (when deploying > with DataflowRunner). So obviously, there is no global window or default > trigger. That’s, I believe, what’s described in the issue: > https://github.com/apache/beam/issues/25598 > > > > > > *From: *Ondřej Pánek > *Date: *Thursday, March 14,

Re: What's the current status of pattern matching with Beam SQL?

2024-03-23 Thread XQ Hu via user
https://beam.apache.org/documentation/dsls/sql/zetasql/overview/ and https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/zetasql/src/main/java/org/apache/beam/sdk/extensions/sql/zetasql/SupportedZetaSqlBuiltinFunctions.java for the supported functions. On Sat, Mar 23, 2024 at 5:50 

Re: how to enable debugging mode for python worker harness

2024-03-18 Thread XQ Hu via user
elated to only the image, thanks. > > The goal for this repo is to complete my previous talk: > https://www.youtube.com/watch?v=XUz90LpGAgc_channel=ApacheBeam > > On Sun, Mar 17, 2024 at 8:07 AM XQ Hu via user > wrote: > >> I cloned your repo on my Linux machine, which i

Re: how to enable debugging mode for python worker harness

2024-03-17 Thread XQ Hu via user
I cloned your repo on my Linux machine, which is super useful to run. Not sure why you use Beam 2.41 but anyway, I tried this on my Linux machine: python t.py \ --topic test --group test-group --bootstrap-server localhost:9092 \ --job_endpoint localhost:8099 \ --artifact_endpoint

Re: java.lang.ClassCastException: class java.lang.String cannot be cast to class...

2024-03-17 Thread XQ Hu via user
Here is what I did including how I setup the portable runner with Flink 1. Start the local Flink cluster 2. Start the Flink job server and point to that local cluster: docker run --net=host apache/beam_flink1.16_job_server:latest --flink-master=localhost:8081 3. I use these pipeline options in

Re: Specific use-case question - Kafka-to-GCS-avro-Python

2024-03-13 Thread XQ Hu via user
Can you explain more about " that current sinks for Avro and Parquet with the destination of GCS are not supported"? We do have AvroIO and ParquetIO ( https://beam.apache.org/documentation/io/connectors/) in Python. On Wed, Mar 13, 2024 at 5:35 PM Ondřej Pánek wrote: > Hello Beam team! > > > >

Re: How to change SQL dialect on beam_sql magic?

2024-03-08 Thread XQ Hu via user
I do not think the dialect argument is exposed here: https://github.com/apache/beam/blob/a391198b5a632238dc4a9298e635bb5eb0f433df/sdks/python/apache_beam/runners/interactive/sql/beam_sql_magics.py#L293 Two options: 1) create a feature request and PR to add that 2) Switch to SqlTransform On Mon,

Re: [Question] Python Streaming Pipeline Support

2024-03-08 Thread XQ Hu via user
Is this what you are looking for? import random import time import apache_beam as beam from apache_beam.transforms import trigger, window from apache_beam.transforms.periodicsequence import PeriodicImpulse from apache_beam.utils.timestamp import Timestamp with beam.Pipeline() as p: input =

Re: Cross Language Runtime error python-Java

2024-02-24 Thread XQ Hu via user
t /opt/apache/beam/jars > /opt/apache/beam/jars COPY --from=apache/beam_java11_sdk:latest > /opt/java/openjdk /opt/java/openjdk ENV JAVA_HOME=/opt/java/openjdk ENV > PATH="${JAVA_HOME}/bin:${PATH}" > > > > On Sat, Feb 24, 2024 at 21:55 XQ Hu via user wrote: > >&

Re: Problem in jdbc connector with autoincrement value

2024-02-24 Thread XQ Hu via user
Here is what I did: CREATE TABLE IF NOT EXISTS test2 (id bigint DEFAULT nextval('sap_tm_customer_id_seq'::regclass) NOT NULL, name VARCHAR(10), load_date_time TIMESTAMP) make sure id cannot be NULL (you might not need this). I tried this for my data without using the id field: class

Re: Cross Language Runtime error python-Java

2024-02-24 Thread XQ Hu via user
Does your code work without the launcher? Better check this step by step to figure out which part causes this error. On Sat, Feb 24, 2024 at 3:25 AM George Dekermenjian wrote: > I have a python pipeline that uses the bigquery storage write method > (cross language with Java). I’m building

Re: Query about `JdbcIO`

2024-02-24 Thread XQ Hu via user
I did not find BEAM-13846 but this suggests String is never supported: https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcUtilTest.java#L59 However, you could use the code from the test to create yours. On Thu, Feb 22, 2024 at 11:20 AM

Re: Does withkeys transform enforce a reshuffle?

2024-01-19 Thread XQ Hu via user
I do not think it enforces a reshuffle by just checking the doc here: https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html?highlight=withkeys#apache_beam.transforms.util.WithKeys Have you tried to just add ReShuffle after PubsubLiteIO? On Thu, Jan 18, 2024 at 8:54 PM

Re: Beam 2.53.0 Release

2024-01-05 Thread XQ Hu via user
Great! And thank you! On Fri, Jan 5, 2024 at 2:49 PM Jack McCluskey via user wrote: > We are happy to present the new 2.53.0 release of Beam. > This release includes both improvements and new functionality. > For more information on changes in 2.53.0, check out the detailed release > notes

Re: Removing old dataflow jobs

2024-01-02 Thread XQ Hu via user
You can archive jobs now: https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline#archive On Tue, Jan 2, 2024 at 8:47 AM Svetak Sundhar via user wrote: > Hello Sumit, > > There is no requirement to delete old jobs, though you can archive > completed jobs via a recently released

Re: [Question] S3 Token Expiration during Read Step

2023-12-22 Thread XQ Hu via user
Can you share some code snippets about how to read from S3? Do you use the builtin TextIO? On Fri, Dec 22, 2023 at 11:28 AM Ramya Prasad via user wrote: > Hello, > > I am a developer trying to use Apache Beam, and I have a nuanced problem I > need help with. I have a pipeline which has to read

Re: [Question] WaitOn for Reading Step

2023-12-22 Thread XQ Hu via user
When I search the Beam code base, there are plenty of places which use Wait.on. You could check these code for some insights. If this doesn't work, it would be better to create a small test case to reproduce the problem and open a Github issue. Sorry, I cannot help too much with this. On Fri, Dec

Re: Environmental variables not accessible in Dataflow pipeline

2023-12-22 Thread XQ Hu via user
You can use the same docker image for both template launcher and Dataflow job. Here is one example: https://github.com/google/dataflow-ml-starter/blob/main/tensorflow_gpu.flex.Dockerfile#L60 On Fri, Dec 22, 2023 at 8:04 AM Sumit Desai wrote: > Yes, I will have to try it out. > > Regards > Sumit

Re: Environmental variables not accessible in Dataflow pipeline

2023-12-20 Thread XQ Hu via user
Dataflow VMs cannot know your local env variable. I think you should use custom container: https://cloud.google.com/dataflow/docs/guides/using-custom-containers. Here is a sample project: https://github.com/google/dataflow-ml-starter On Wed, Dec 20, 2023 at 4:57 AM Sofia’s World wrote: > Hello

Re: Specifying dataflow template location with Apache beam Python SDK

2023-12-18 Thread XQ Hu via user
https://github.com/google/dataflow-ml-starter/tree/main?tab=readme-ov-file#run-the-beam-pipeline-with-dataflow-flex-templates has a full example about how to create your own flex template. FYI. On Mon, Dec 18, 2023 at 5:01 AM Bartosz Zabłocki via user < user@beam.apache.org> wrote: > Hi Sumit, >

Re: Beam 2.52.0 Release

2023-11-18 Thread XQ Hu via user
Thanks a lot! Great job, Team! On Fri, Nov 17, 2023 at 7:21 PM Danny McCormick via user < user@beam.apache.org> wrote: > I am happy to announce that the 2.52.0 release of Beam has been finalized. > This release includes both improvements and new functionality. > > For more information on changes

Re: [QUESTION] Why no auto labels?

2023-10-03 Thread XQ Hu via user
That suggests the default label is created as that, which indeed causes the duplication error. On Tue, Oct 3, 2023 at 9:15 PM Joey Tran wrote: > Not sure what that suggests > > On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user wrote: > >> Looks like this is the current behavio

Re: [QUESTION] Why no auto labels?

2023-10-03 Thread XQ Hu via user
Looks like this is the current behaviour. If you have `t = beam.Filter(identity_filter)`, `t.label` is defined as `Filter(identity_filter)`. On Mon, Oct 2, 2023 at 9:25 AM Joey Tran wrote: > You don't have to specify the names if the callable you pass in is > /different/ for two `beam.Map`s,

Re: Pandas 2 Timeline Estimate

2023-07-12 Thread XQ Hu via user
https://github.com/apache/beam/issues/27221#issuecomment-1603626880 This tracks the progress. On Wed, Jul 12, 2023 at 7:37 PM Adlae D'Orazio wrote: > Hello, > > I am currently trying to use Interactive Beam to run my pipelines through > a Jupyter notebook, but I > have internal packages

Re: Tour of Beam - an interactive Apache Beam learning guide

2023-06-15 Thread XQ Hu via user
We already have a Beam Overview there. https://beam.apache.org/get-started/tour-of-beam/ contains some good Colab notebooks, which mainly are just for Python. I suggest we link this to https://tour.beam.apache.org/ but move the current content under Python Quickstart. On Thu, Jun 15, 2023 at

Re: [Error] Unable to submit job to Dataflow Runner V2

2023-05-27 Thread XQ Hu via user
Can you check whether your code has any options that contain any of [disable_runner_v2, disable_prime_runner_v2, disable_prime_streaming_engine]? On Sat, May 27, 2023 at 5:29 AM Mário Costa via user wrote: > I have a pipeline built using Apache Beam java SDK version 2.46.0 when > submitting a

Re: Apache beam

2023-05-06 Thread XQ Hu via user
You could create a batch pipeline that reads GCS and writes to BigQuery. And you can use this template https://cloud.google.com/dataflow/docs/guides/templates/provided/cloud-storage-to-bigquery . On Sat, May 6, 2023 at 1:10 AM Utkarsh Parekh wrote: > Hi, > > I'm writing a simple streaming beam

Re: Loosing records when using BigQuery IO Connector

2023-05-03 Thread XQ Hu via user
https://github.com/apache/beam/issues/26515 tracks this issue. The fix was merged. Thanks a lot for reporting this issue, Binh! On Mon, Apr 17, 2023 at 12:58 PM Binh Nguyen Van wrote: > Hi, > > I tested with streaming insert and file load, and they all worked as > expected. But looks like

Re: Loosing records when using BigQuery IO Connector

2023-04-17 Thread XQ Hu via user
Does FILE_LOADS ( https://beam.apache.org/releases/javadoc/2.46.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html#FILE_LOADS) work for your case? For STORAGE_WRITE_API, it has been actively improved. If the latest SDK still has this issue, I highly recommend you to create a Google