Re: Issue with growing state/checkpoint size

2023-09-06 Thread Wiśniowski Piotr
this in generic form). Best Wiśniowski Piotr  On 1.09.2023 15:56, Byron Ellis via user wrote: Depends on why you're using a fan-out approach in the first place. You might actually be better off doing all the work at the same time. On Fri, Sep 1, 2023 at 6:43 AM Ruben Vargas wrote: Ohh I

Re: UDF/UADF over complex structures

2023-10-02 Thread Wiśniowski Piotr
like to take a look at them? Best Wiśniowski Piotr On 28.09.2023 19:31, Balogh, György wrote: Sorry I was not specific enough. I ment using the SqlTransform registerUdf and registerUdaf. I use a lot of SQL in my pipeline and I would prefer using SQL UDFs in many cases over writing beam

Deployment approach for stateful streaming jobs

2023-10-03 Thread Wiśniowski Piotr
Hi, Any typical patterns for deployment of stateful streaming pipelines from Beam? Targeting Dataflow and Python SDK with significant usage of stateful processing with long windows (typically days long). From our current practice of maintaining pipelines we did identify 3 typical scenarios:

Fwd: [Python SDK] PyArrow Critical Vulnerability

2023-11-10 Thread Wiśniowski Piotr
ons of pyarrow? 2. Is there any planned effort on updating this to `14.0.1`? Is it possible to push the update to `2.52.0` beam release? I know the beam release is almost there. Best Wiśniowski Piotr

Re: Backup event from Kafka to S3 in parquet format every minute

2023-02-20 Thread Wiśniowski Piotr
. If You remember that It is possible to read or write to partitioned Parquet files as just `PTransform` that's great! I probably must have made some minor mistake in my trials. But eager to learn what was the mistake. Best regards Wiśniowski Piotr I did find solution like this one: https

Re: Backup event from Kafka to S3 in parquet format every minute

2023-02-17 Thread Wiśniowski Piotr
. Best Wiśniowski Piotr On 17.02.2023 02:00, Lydian wrote: I want to make a simple Beam pipeline which will store the events from kafka to S3 in parquet format every minute. Here's a simplified version of my pipeline: |def add_timestamp(event: Any) -> Any: from datetime import datetime f

Beam shell sql with zeta

2023-04-20 Thread Wiśniowski Piotr
Hi, I have a question regarding usage of Zeta with SQL extensions in SQL shell. I try to: ``` SET runner = DirectRunner; SET tempLocation = `/tmp/test/`; SET streaming=`True`; SET plannerName = `org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner`; CREATE EXTERNAL TABLE

Beam shell sql with zeta

2023-04-20 Thread Wiśniowski Piotr
Hi, I have a question regarding usage of Zeta with SQL extensions in SQL shell. I try to: ``` SET runner = DirectRunner; SET tempLocation = `/tmp/test/`; SET streaming=`True`; SET plannerName = `org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner`; CREATE EXTERNAL TABLE

Re: Beam shell sql with zeta

2023-04-20 Thread Wiśniowski Piotr
SQL? Support for struts is somewhat limited. I know there are bugs around nested structs and structs with single values. Andrew On Thu, Apr 20, 2023 at 9:26 AM Wiśniowski Piotr wrote: Hi, I have a question regarding usage of Zeta with SQL extensions in SQL shell. I

Re: [java] Trouble with gradle and using ParquetIO

2023-04-21 Thread Wiśniowski Piotr
that maven suggest this. Best Wiśniowski Piotr On 21.04.2023 08:30, Moritz Mack wrote: Hi Evan, Not sure why maven suggests using “compileOnly”. That’s certainly wrong, make sure to use “implementation” in your case. Cheers, Moritz On 21.04.23, 01:52, "Evan Galpin" wrote: Hi all

Re: Can I batch data when i use JDBC write operation?

2023-04-21 Thread Wiśniowski Piotr
://stackoverflow.com/questions/1109061/insert-on-duplicate-update-in-postgresql/8702291#8702291 4. some mix of p.2 and p.3 Hopefully this helps You in breaking the problem. Best regards Wiśniowski Piotr On 21.04.2023 01:34, Juan Romero wrote: Hi. Can someone help me with this? El mié, 19 abr 2023

Fwd: Beam SQL Extensions Windowing question/problem

2023-02-13 Thread Wiśniowski Piotr
ogle.com/dataflow/docs/reference/sql/streaming-extensions) that is present also in Calcite (https://calcite.apache.org/docs/reference.html#tumble) but Beam SQL does not parse it. Any idea what is the roadmap for this functionalities? Really appreciate quick reply. Best regards Wiśniowski Piotr

Re: Beam SQL found limitations

2023-05-26 Thread Wiśniowski Piotr
depending on my team decision - but would like to get like idea how hard this task is. Will create tickets for the rest of issues when I will have some spare time. Best regards Wiśniowski Piotr On 22.05.2023 18:28, Alexey Romanenko wrote: Hi Piotr, Thanks for details! I cross-post this to dev

Re: Beam SQL found limitations

2023-05-31 Thread Wiśniowski Piotr
. Kenn On Fri, May 26, 2023 at 11:06 AM Wiśniowski Piotr wrote: Hi Alexey, Thank You for reference to that discussion I do actually have pretty similar thoughts on what Beam SQL needs. Update from my side: Actually did find a workaround for issue with windowing function

Re: Delete window information

2023-08-01 Thread Wiśniowski Piotr
ode to tell beam when to emit elements for the aggregated keys. Hope this helps. If not please provide a bit more details what exactly You are trying to do so that I can try to help. Also You could try to get more familiar with the abstraction with the help of https://tour.beam.apache.org/.

Beam SQL found limitations

2023-05-18 Thread Wiśniowski Piotr
pe present in the schema - 1. `CROSS JOIN` between two unrelated tables is not supported - even if one is a static number table - 2. ROW construction not supported. It is not possible to nest data Below queries tat I use to testing this scenarios. Thank You for looking at this topics! Best Wiśnio

Re: [Dataflow][Java][2.52.0] Upgrading to 2.52.0 Surfaces Pubsub Coder Error

2024-01-03 Thread Wiśniowski Piotr
it seems that pipeline state on DataFlow depends somehow on the order in which steps are added to pipeline since some recent versions (as I do recall this was working correctly ~2.50?). Anyone knows if this is intended? If yes would like to know some explanation. Best regards Wiśniowski

Re: [Python SDK] PyArrow Critical Vulnerability

2023-11-12 Thread Wiśniowski Piotr
Hi Valentyn, Thank You for information and details. All make sense! I think we can wait for 2.53.0 release and meantime apply hotfix. Best Wiśniowski Piotr On 10.11.2023 20:27, Valentyn Tymofieiev via user wrote: From https://pypi.org/project/pyarrow-hotfix/ : pyarrow_hotfix must

Re: Roadmap of Calcite support on Beam SQL?

2024-03-04 Thread Wiśniowski Piotr
GROUP BY TUMBLE(cte.ts, INTERVAL '1'MINUTE) ; ``` Maybe it would be useful for you. Note that I am not up to date with recent versions of Beam SQL, but I will need to catch up (the syntax for defining window on table is cool). Best Wiśniowski Piotr On 4.03.2024 05:27, Jaehyeon Kim wrote

Updating streaming DataFlow pipeline

2024-03-06 Thread Wiśniowski Piotr
for every `PCollection`. What else should I look into, which could result in such behavior? Best Wiśniowski Piotr

Re: [Question] Does Apache SnowflakeIO support Apache Parquet?

2024-03-05 Thread Wiśniowski Piotr
Hi Xinmin, After quick look at the connection in java it seems that the csv format was an explicit decision. As this is entirely transparent to user I am just curious why would You like to use parquet instead? What do you want to achieve? Regards, Wiśniowski Piotr On 5.03.2024 05:47

Re: [Question] Python Streaming Pipeline Support

2024-03-08 Thread Wiśniowski Piotr
me few days to grasp how to handle this with multiple input streams, but I do plan to create a post with findings. I think I had similar days long investigation few months ago. Let me know if this is helpful. Best Wiśniowski Piotr On 9.03.2024 03:29, Puertos tavares, Jose J (Canada) via

Watermark progress halt in Python streaming pipelines

2024-04-23 Thread Wiśniowski Piotr
processors, shutdown logic can overaggressively hold the lock, blocking acceptance of new work` in Beam release docs as known issue. What is the status of this? Can this potentially be related? Really appreciate any help, clues or hints how to debug this issue. Best regards Wiśniowski Piotr

Re: Watermark progress halt in Python streaming pipelines

2024-04-24 Thread Wiśniowski Piotr
confirmation. BTW great job on the investigation of bug that you mentioned. Impressive. Seems like a nasty one. Best, Wiśniowski Piotr On 24.04.2024 00:31, Valentyn Tymofieiev via user wrote: You might be running into https://github.com/apache/beam/issues/30867. Among the error messages you

Re: Hot update in dataflow without lossing messages

2024-04-28 Thread Wiśniowski Piotr
and drain old pipelines. Note all above is just hypothesis, but hopefully it might be helpful. Best Wiśniowski Piotr On 16.04.2024 05:15, Juan Romero wrote: The deployment in the job is made by terraform. I verified and seems that terraform do it incorrectly under the hood because it stop the cu

Re: Any recomendation for key for GroupIntoBatches

2024-04-28 Thread Wiśniowski Piotr
of your runner) - shuffling would be required when creating artificial random-looking keys. Note that above is Python, but I do bet there is Java counterpart (or at least easy to implement). Best Wiśniowski Piotr On 15.04.2024 19:14, Reuven Lax via user wrote: There are various strategies

Re: Watermark progress halt in Python streaming pipelines

2024-04-30 Thread Wiśniowski Piotr
Hi, Getting back with update. After updating `grpcio` the issues are gone. Thank you for the solution and investigation. Feels like I own you a beer :) Best Wiśniowski Piotr On 24.04.2024 22:11, Valentyn Tymofieiev wrote: On Wed, Apr 24, 2024 at 12:40 PM Wiśniowski Piotr wrote