Backup event from Kafka to S3 in parquet format every minute

2023-02-16 Thread Lydian
I want to make a simple Beam pipeline which will store the events from
kafka to S3 in parquet format every minute.

Here's a simplified version of my pipeline:

def add_timestamp(event: Any) -> Any:
from datetime import datetime
from apache_beam import window

return window.TimestampedValue(event,
datetime.timestamp(event[1].timestamp))
# Actual Pipeline
(
  pipeline
  | "Read from Kafka" >> ReadFromKafka(consumer_config, topics,
with_metadata=False)
  | "Transformed" >> beam.Map(my_transform)
  | "Add timestamp" >> beam.Map(add_timestamp)
  | "window" >> beam.WindowInto(window.FixedWindows(60))  # 1 mins
  | "writing to parquet" >>
beam.io.WriteToParquet('s3://test-bucket/', pyarrow_schema)
)

However, the pipeline failed with

GroupByKey cannot be applied to an unbounded PCollection with global
windowing and a default trigger

This seems to be coming from
https://github.com/apache/beam/blob/v2.41.0/sdks/python/apache_beam/io/iobase.py#L1145-L1146
which
always add a GlobalWindows and thus causing this error. Wondering what I
should do to correctly backup the event from Kafka (Unbounded) to S3.
Thanks!

btw, I am running with portableRunner with Flink. Beam Version is 2.41.0
(the latest version seems to have the same code) and Flink version is 1.14.5



Sincerely,
Lydian Lee


Re: OpenJDK8 / OpenJDK11 container deprecation

2023-02-16 Thread Luke Cwik via user
I upgraded the docker version on Jenkins workers and the tests passed.
(also installed Python 3.11 so we are ready for that)

On Tue, Feb 14, 2023 at 3:21 PM Kenneth Knowles  wrote:

> SGTM. I asked on the PR if this could impact users, but having read the
> docker release calendar I am not concerned. The last update to the old
> version was in 2019, and the introduction of compatible versions was 2020.
>
> On Tue, Feb 14, 2023 at 3:01 PM Byron Ellis via user 
> wrote:
>
>> FWIW I am Team Upgrade Docker :-)
>>
>> On Tue, Feb 14, 2023 at 2:53 PM Luke Cwik via user 
>> wrote:
>>
>>> I made some progress in testing the container and did hit an issue where
>>> Ubuntu 22.04 "Jammy" is dependent on the version of Docker installed. It
>>> turns out that our boot.go crashes with "runtime/cgo: pthread_create
>>> failed: Operation not permitted" because the Ubuntu 22.04 is using new
>>> syscalls that Docker 18.09.4 doesn't have a seccomp policy for (and uses a
>>> default of deny). We have a couple of choices here:
>>> 1) upgrade the version of docker on Jenkins and require users to
>>> similarly use a new enough version of Docker so that this isn't an issue
>>> for them
>>> 2) use Ubuntu 20.04 "Focal" as the docker container
>>>
>>> I was using Docker 20.10.21 which is why I didn't hit this issue when
>>> testing the change locally.
>>>
>>> We could also do these but they same strictly worse then either of the
>>> two options discussed above:
>>> A) disable the seccomp policy on Jenkins
>>> B) use a custom seccomp policy on Jenkins
>>>
>>> My suggestion is to upgrade Docker versions on Jenkins and use Ubuntu
>>> 22.04 as it will have LTS releases till 2027 and then security patches till
>>> 2032 which gives everyone the longest runway till we need to swap OS
>>> versions again for users of Apache Beam. Any concerns or ideas?
>>>
>>>
>>>
>>> On Thu, Feb 9, 2023 at 10:20 AM Luke Cwik  wrote:
>>>
 Our current container java 8 container is 262 MiBs and layers on top of
 openjdk:8-bullseye which is 226 MiBs compressed while eclipse-temurin:8 is
 92 MiBs compressed and eclipse-temurin:8-alpine is 65 MiBs compressed.

 I would rather not get into issues with C library differences caused by
 the alpine project so I would stick with the safer option and let users
 choose alpine when building their custom container if they feel it provides
 a large win for them. We can always swap to alpine in the future as well if
 the C library differences become a non-issue.

 So swapping to eclipse-temurin will save us a bunch on the container
 size which should help with container transfer and hopefully for startup
 times as well.

 On Tue, Feb 7, 2023 at 5:41 PM Andrew Pilloud 
 wrote:

> This sounds reasonable to me as well.
>
> I've made swaps like this in the past, the base image of each is
> probably a bigger factor than the JDK. The openjdk images were based on
> Debian 11. The default eclipse-temurin images are based on Ubuntu 22.04
> with an alpine option. Ubuntu is a Debian derivative but the versions and
> package names aren't exact matches and Ubuntu tends to update a little
> faster. For most users I don't think this will matter but users building
> custom containers may need to make minor changes. The alpine option will 
> be
> much smaller (which could be a significant improvement) but would be a 
> more
> significant change to the environment.
>
> On Tue, Feb 7, 2023 at 5:18 PM Robert Bradshaw via dev <
> d...@beam.apache.org> wrote:
>
>> Seams reasonable to me.
>>
>> On Tue, Feb 7, 2023 at 4:19 PM Luke Cwik via user <
>> user@beam.apache.org> wrote:
>> >
>> > As per [1], the JDK8 and JDK11 containers that Apache Beam uses
>> have stopped being built and supported since July 2022. I have filed [2] 
>> to
>> track the resolution of this issue.
>> >
>> > Based upon [1], almost everyone is swapping to the eclipse-temurin
>> container[3] as their base based upon the linked issues from the
>> deprecation notice[1]. The eclipse-temurin container is released under
>> these licenses:
>> > Apache License, Version 2.0
>> > Eclipse Distribution License 1.0 (BSD)
>> > Eclipse Public License 2.0
>> > 一 (Secondary) GNU General Public License, version 2 with OpenJDK
>> Assembly Exception
>> > 一 (Secondary) GNU General Public License, version 2 with the GNU
>> Classpath Exception
>> >
>> > I propose that we swap all our containers to the eclipse-temurin
>> containers[3].
>> >
>> > Open to other ideas and also would be great to hear about your
>> experience in any other projects that you have had to make a similar
>> decision.
>> >
>> > 1: https://github.com/docker-library/openjdk/issues/505
>> > 2: https://github.com/apache/beam/issues/25371
>> > 3: