Re: Backup event from Kafka to S3 in parquet format every minute

2023-02-22 Thread Lydian
I've tested the same window, trigger, allowed_lateness nothing seems to work. I think the main issue is in the writerimpl which I linked earlier. https://github.com/apache/beam/blob/v2.41.0/sdks/python/apache_beam/io/iobase.py#L1033 According to the doc: > Currently only batch workflows support

Re: Backup event from Kafka to S3 in parquet format every minute

2023-02-20 Thread Wiśniowski Piotr
Hi Alexey, I am just learning Beam and doing POC that requires fetching stream data from PubSub and partitioning it on gs as parquet files with constant window. The thing is I have additional requirement to use ONLY SQL. I did not manage to do it. My solutions either worked indefinitely or

Re: Backup event from Kafka to S3 in parquet format every minute

2023-02-17 Thread Alexey Romanenko
Piotr, > On 17 Feb 2023, at 09:48, Wiśniowski Piotr > wrote: > Does this mean that Parquet IO does not support partitioning, and we need to > do some workarounds? Like explicitly mapping each window to a separate > Parquet file? > Could you elaborate a bit more on this? IIRC, we used to

Re: Backup event from Kafka to S3 in parquet format every minute

2023-02-17 Thread Pavel Solomin
For me this use-case worked with the following window definition, which was a bit of try-and-fail, and I can't claim I got 100% understanding of windowing logic. Here's my java code for Kinesis -> Parquet files which worked:

Re: Backup event from Kafka to S3 in parquet format every minute

2023-02-17 Thread Wiśniowski Piotr
Hi, Sounds like exact problem that I have few emails before - https://lists.apache.org/thread/q929lbwp8ylchbn8ngypfqlbvrwpfzph Does this mean that Parquet IO does not support partitioning, and we need to do some workarounds? Like explicitly mapping each window to a separate Parquet file?

Backup event from Kafka to S3 in parquet format every minute

2023-02-16 Thread Lydian
I want to make a simple Beam pipeline which will store the events from kafka to S3 in parquet format every minute. Here's a simplified version of my pipeline: def add_timestamp(event: Any) -> Any: from datetime import datetime from apache_beam import window return